NAME

Parse::Marpa::CONCEPTS - Concepts helpful for Using Marpa

BEWARE: THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE

THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE

OVERVIEW

This document is about concepts for putting Marpa to work in your applications. It's not about the mathematics or theory. The theory behind Marpa is documented elsewhere.

It's also not a tutorial on grammars or BNF. For that consult a modern textbook, such as Grune and Jacobs Parsing Techniques - Second Edition, or Wikipedia. In Wikipedia, the article on Backus-Naur form is a good place to start.

Only concepts common to all of Marpa's grammar interfaces are covered. Speaking of which ...

GRAMMAR INTERFACES

A grammar is specified to Marpa through a grammar interface, which may itself be described by Marpa grammar. Right now there are only two grammar interfaces: the Marpa Demonstration Language and the raw grammar interface.

The Raw Grammar Interface

The raw grammar interface is a set of options to the constructor for Marpa grammar objects, Parse::Marpa::new(). All the other grammar interfaces use the raw grammar interface indirectly. The raw grammar interface is efficient, but users will usually want something higher level. The documentation for the raw grammar interface (as yet unwritten) is Parse::Marpa::RAW.

The Marpa Demonstration Language

In Marpa's eyes all higher level grammar interfaces will be equal. I call the one that I am delivering with Marpa the Marpa Demonstration Language instead of the "Marpa Language" to emphasize it's lack of special status. Its documentation is at Parse::Marpa::LANGUAGE.

Your Grammar Interface Here

Users are encouraged to design their own high-level Marpa interfaces.

TOKENS AND EARLEMES

As a reminder, the usual method of parsing first breaks the input text up into tokens. Typically tokens are recognized (or "lexed") with regular expressions or something similar. The actual parsing is then done on the sequence of tokens. In conventional parsing, it's required that the token sequence will be deterministic -- that is, that there'll only be one sequence of tokens and that that sequence can be found by the lexer more or less on its own. In conventional Earley parsing, there is one "earley set" per token.

Marpa allows ambiguous tokens. Marpa tokens allows recognition, at a single location, of several different tokens of varying length. Tokens may occur in the input stream in arbitrary patterns. Nothing, for example, prevents them from overlapping.

What is considered a "location" is up to the user, with two restrictions:

  1. Tokens cannot be zero or negative in length.

  2. Tokens must be grouped into "locations" called earlemes. and must be scanned in order. That is, all the tokens at earleme N must be recognized before any token at earleme N+1.

A parse is said to start at earleme 0, and "earleme N" means the location N earlemes after earleme 0. (Aside to experts: The implementation uses one Earley set for each earleme.) I also refer to "distances" in earlemes and lengths in earlemes, and my meaning is probably what you expect. The length from earleme 3 to earleme 6, for instance, is 3 earlemes.

The conventional model of Earley parsing corresponds to a one-earleme-per-token model in Marpa. Marpa's Parse::Marpa::Parse::text() method uses a model where there's one earleme per character. text() lexes the input string for the user, and feeds lexed substrings as tokens to Marpa, giving as their earleme length their length in characters.

In conventional Earley parsing, a "location" without a token means the parse is exhausted. In Marpa, tokens can span many earlemes. A parse is viable as long as some token has been recognized which ends at or after the current earleme. Only when there is no token at the current location, and no token reaches to the current location or past it, is the parse exhausted. Parses using the grammar for the MDL typically contain many stretches of empty earlemes, some quite long. (Note to experts: an "empty earleme" corresponds to an Earley set with no Earley items.)

Users of Marpa are not restricted to either the one-token-per-earleme or the one-character-per-earleme scheme. Input tokens may be fed directly to Marpa with the Parse::Marpa::Parse::earleme() method and a user may supply earleme lengths according to any rules he finds useful, subject to the two restrictions above.

THE STEPS OF MARPA PARSING

In parsing a text, Marpa follows a strict sequence, some of all of which is usually invisible to the user. For example, when a parse object is created from a grammar which has not been precomputed, the parse object constructor will silently perform not just the precomputation of the grammar, but also a deep copy of it. If the Parse::Marpa::marpa() routine is used, all the steps will be performed automatically and none of the methods listed below them can or should be called directly.

In each step, the method mentioned is the lowest level one available to the user. Calling those methods in that sequence is not the easiest approach and will rarely be the best one. See the main Parse::Marpa documentation page for pointers to easier interfaces, as well as instructions on how to exercise step-by-step control when that is what you want. The detailed documentation of each method often has hints on how that method is best used.

Here are the steps:

  • Creation of a grammar object

    A grammar object is created with Parse::Marpa::new(), although it may called indirectly.

  • Adding rules to the grammar object

    Rules must added to the grammar object. This is done using the interfaces, and the raw interface is always involved. The raw interface may be called directly, or it may be hidden behind a higher level interface. At the lowest level, rules are added with the Parse::Marpa::new() and the Parse::Marpa::set() methods.

  • Precomputing the grammar object

    Before a parse object can be created, Marpa must do a series of precomputations on the grammar. This step rarely needs to be performed explicitly, but when that is necessary, the method call is Parse::Marpa::precompute().

  • Deep copying the grammar

    Marpa parse objects work with a copy of the grammar, so that multiple parses from a single grammar can be performed at the same time without interfering with each other. The deep copy is done in two subphases. First, the grammar is written out with Data::Dumper. Second, the string from the first subphase is eval'ed back into a grammar object and tweaked.

    These two subphases are available to the user as Parse::Marpa::compile() and Parse::Marpa::decompile(). The result of compile() is a string, which may be written into a file. A subsequent Marpa process may read this file and continue the parse. See the descriptions of Parse::Marpa::compile() and Parse::Marpa::decompile() for more details.

  • Creating the parse object

    To parse a text, a parse object must be created. The constructor Parse::Marpa::Parse::new() is always called to do this, whether directly or indirectly. Strings and options describing the semantics may have been set in earlier phases, but it is now that the semantics are finalized. After this point they cannot change.

  • Token Input

    A series of tokens is input to the parse object, and recognition is performed as the input is received. Marpa, therefore, will eventually be capable of on-line or stream processing.

    Currently input may be specified as text, with the Parse::Marpa::Parse::text() method, or directly as tokens with the Parse::Marpa::Parse::earleme() method.

  • Initial Parse Evaluation

    Once the input is recognized, it can be evaluated. The first value is computed with the Parse::Marpa::Parse::initial() method. initial()'s return value indicates success or failure. The actual value of the parse is returned by the Parse::Marpa::Parse::value() method.

  • Parse Iteration

    From a given grammar and input, Marpa might produce more than one parse. These are iterated through with the Parse::Marpa::Parse::next() method, which returns failure once there are no more parses. As the parses are iterated, their values may be retrieved with Parse::Marpa::Parse::value() method.