NAME

Parse::Marpa - Earley's algorithm with LR(0) precomputation

SYNOPSIS

use 5.010_000;
use strict;
use warnings;
use English qw( -no_match_vars ) ;
use Parse::Marpa;

# remember to use refs to strings
my $value = Parse::Marpa::mdl(
    (do { local($RS) = undef; my $source = <DATA>; \$source; }),
    \("2+2*3")
);
say $$value;

__DATA__
semantics are perl5.  version is 0.212.0.  start symbol is Expression.

Expression: Expression, /[*]/, Expression.  priority 200.  q{
    $_[0] * $_[2]
}.

Expression: Expression, /[+]/, Expression.  priority 100.  q{
    $_[0] + $_[2]
}.

Expression: /\d+/.  q{ $_[0] }.

DESCRIPTION

Status

This is an Alpha release. It's intended to let people look Marpa over and try it out. Uses beyond that are risky. While Marpa is in alpha, you don't want to use it for anything mission-critical or with a serious deadline. I've no personal experience with them, but Parse::Yapp and Parse::RecDescent are alternatives to this module which are well reviewed and more mature and stable.

What Marpa can do

Marpa parses any language which can be written in BNF, with one restriction. Marpa handles all proper context-free grammars, plus those with empty rules, inaccessble rules and unproductive rules. Marpa parses left- and right-recursive grammars and ambiguous grammars.

Marpa's one restriction is that it won't parse infinitely ambiguous grammars. Since a user seldom wants a program to loop forever in an attempt to produce an infinite number of parse trees, that is not a big restriction. Infinite ambiguity only happens in grammars with cycles -- cases where a symbol string non-trivially derives exactly the same symbol string. A cycle in a grammar creates a situation where the same derivation steps might be repeated over and over again without end.

Empty productions are often necessary to express a language in a natural way. Inaccessible rules and unproductive rules aren't useful, but they cause no real harm.

So long as they are not infinitely ambiguous, ambiguous grammars are a Marpa specialty. (An ambiguous grammar is one for which there is some input which has more than one parse tree.) Ambiguity is often useful even if you are only interested in one parse. An ambiguous grammar is often the easiest and most sensible way to express a language.

Human languages are ambiguous. English sentences are often ambiguous, but human listeners hear the parse that makes most sense. Marpa allows the user to prioritize rules so that a preferred parse is returned first. Marpa can also return all the parses of an ambiguous grammar, if that's what the user prefers.

I may lift the restriction against cycles in a future version, allowing Marpa to produce whatever non-cyclical parse trees an infinitely ambiguous grammar might have. That doesn't seem like it will really offer any advantages in convenience or expressiveness, but there will be a gain in bragging rights. Marpa would then be able to say that it parses anything you can write in BNF, with no restrictions whatsoever.

Marpa incorporates major improvements from recent research into Earley's algorithm. In particular, it combines Earley's with LR(0) precomputation. Marpa also has innovations of its own, including predictive and ambiguous lexing.

The Easy Way

Most of Marpa's capabilities are available using a single static method: Parse::Marpa::mdl. The mdl method requires a grammar description in MDL (the Marpa Description Language) and a string. mdl parses the string according to the MDL description. In scalar context, mdl returns a reference to the value of the first parse. In list context, it returns references to the values of all parses. See below for more detail about the mdl static method.

Parsing Terminology

The parsing terms in these documents are either explained in these documents or are in standard use. However, just because a parsing term is in "standard use" doesn't mean it will be familiar. Even if you've studied parsing, you might not have run across that particular term, or might not remember exactly what it meant. I define all the terms I treat as "standard" in Parse::Marpa::Doc::Parse_Terms. The parse terms document is designed for skimming: the defining uses of the terms are all in boldface.

If you want an introduction to parsing concepts, the chapter on parsing in Mark Jason Dominus's Higher Order Perl is an excellent description of them in the Perl context. Online, Wikipedia is a good place to start.

Semantic Actions

Marpa's semantics are specified with Perl 5 code strings, called actions. Marpa allows rule actions, null symbol actions and a preamble action. It also allows lexing actions and a separate lex preamble action.

Rule and null symbol actions are calculated in a special namespace set aside for that purpose. The preamble action is always run first and can be used to initialize that namespace.

Lex actions are run in a special namespace devoted to lex actions. The special lex preamble action can be used to initialize that namespace.

The result of an action is the result of running its Perl 5 code string. From the synopsis, here's a rule for an expression that does addition:

Expression: Expression, /[+]/, Expression.

and here's its action:

$_[0] + $_[2]

In rule actions, @_ is an array containing the values of the symbols on the left hand side of the rule, as if they had been passed as arguments to a subroutine. Actions may not always be implemented as Perl subroutines, so so please do not return out of an action.

Marpa is targeted to Perl 6. When Perl 6 is ready, Perl 6 code will become its default semantics.

Null Symbol Values

Every symbol has a null symbol value, or more briefly, a null value, and this is used as the value of the symbol when it is nulled. The default null value is a Marpa option (default_null_value). If not explicitly set, default_null_value is a Perl 5 undefined.

Every symbol can have its own null symbol value. In cases where a null symbol derives other null symbols, only the value of the symbol highest in the null derivation is used. For details and examples, see "Null Symbol Values" in Parse::Marpa::Evaluator.

Lexing

The easiest way to parse a Perl 5 string in Marpa is to use MDL's default lexing. MDL allows terminals to be defined either as Perl 5 regexes or, for difficult cases, as lex actions, which are Perl 5 code. Unlike most parser generators, Marpa does not require that the regexes and lex actions result in a deterministic lexer. It is OK with Marpa if more than one token is possible at a location, or if possible tokens overlap. Ambiguities encountered in lexing are passed up to Marpa's parse engine, and dealt with there.

Marpa allows terminals to appear on the left hand side of rules. Most parsers have a problem with this, but Marpa does not.

Marpa is not restricted to MDL's model of lexing. Advanced users can invent new models of the input, customized to their applications. For more detail see "Tokens and Earlemes" in Parse::Marpa::Grammar.

Lack of Backward Compatibility

While this module is in alpha, versions may not be backward compatible. MDL protects users by requiring the version to be specified, and by insisting on an exact match with Marpa's version number. This strict version regime is the same as that being considered for Perl 6. Nonetheless, Marpa's version matching may become less strict once it goes beta.

Getting Started

Before you begin, you will want to look the Parse::Marpa::Doc::MDL document. Here's all you should need to get started:

What is in the Other Documents

As you get into advanced applications of Marpa, the first places to look will be the documents for the various phases of Marpa parsing: Parse::Marpa::Grammar, Parse::Marpa::Recognizer, and Parse::Marpa::Evaluator.

A few documents describe details you may never need. Parse::Marpa::Doc::Plumbing documents Marpa's plumbing. Parse::Marpa::MDL documents utilities for converting MDL symbol names to plumbing interface names. Parse::Marpa::Lex documents some lex actions which are used by MDL, and which are available to users for their own lexing.

For very advanced diagnostics or for reading Marpa's code, it is necessary to understand Marpa's internals. These are described in Parse::Marpa::Doc::Internals.

For those interesting in the theory behind Marpa and the details of its programming, Parse::Marpa::Doc::Algorithm describes the algorithms, explains how Marpa would not have been possible without the the work of others, and details what is new with Marpa. Details about sources (books, web pages and articles) referred to in these documents or used in the writing of Marpa are collected in Parse::Marpa::Doc::Bibliography. Parse::Marpa::Doc::To_Do is the list of things that might be done to Marpa in the future.

Phases

The mdl method hides the details of creating Marpa objects and using Marpa's object methods from the user. But for advanced applications, and for tracing and diagnostics, it is useful to know in detail how Marpa works.

Marpa parsing take place in three phases: grammar creation, input recognition and parse evaluation. For brevity, I'll often speak of the the parse evaluation phase as the evaluation phase, and the input recognition phase as the recognition phase.

Corresponding to the three phases, Marpa has three kinds of object: grammars, recognizers and evaluators. Recognizers are created from grammars and evaluators are created from recognizers.

Grammars

Grammar objects (Parse::Marpa::Grammar) are created first. They may be created with rules or empty. Rules may be added to grammar objects after they have been created. After all the rules have been added, but before it is used to create a recognizer, a grammar must be precomputed. Details on grammar objects and methods can be found at Parse::Marpa::Grammar.

Recognizers

To create a Marpa recognizer object (Parse::Marpa::Recognizer), a Marpa grammar object is required. Once a recognizer object has been created, it can accept input. You can create multiple recognizers from a single grammar, and can safely run them simultaneously.

Recognizing an input is answering the "yes" or "no" question: Does the input match the grammar? While recognizing its input, Marpa builds tables. Marpa's evaluation phase works from these tables. Before creation of an evaluation object, the input has not been parsed in the strict sense of the term, that is, its structure according to the grammar has not been determined. For more details on recognizer objects and methods, see Parse::Marpa::Recognizer.

Currently, Marpa fully supports only non-streaming or "offline" input. Marpa will also parse streamed inputs, but the methods to find completed parses in a streamed input are still under construction.

Evaluators

In offline mode, once input is completed, an evaluator object (Parse::Marpa::Evaluator) can be created. For each recognizer, only one evaluator object can be in use at any one time.

An evaluator object is an iterator. If the grammar is ambiguous, the evaluator object can be used to return the values of all the parses. For details on evaluator objects and methods, see Parse::Marpa::Evaluator.

Plumbing and Porcelain

A grammar is specified to Marpa through a grammar interface. There are two kinds of grammar interfaces, porcelain and plumbing. There is only one plumbing interface. As of the moment there is also only one porcelain interface, the Marpa Demonstration Language.

The plumbing is a set of named arguments to the new and set methods of Marpa grammar objects. Porcelain interfaces use the plumbing indirectly. The plumbing is efficient, but MDL is easier to read, write and maintain. Users seeking efficiency are usually better off using compiled MDL. The documentation for the plumbing is Parse::Marpa::Doc::Plumbing.

Users are encouraged to design their own porcelain. In Marpa's eyes all porcelain will be equal. I call the porcelain that I am delivering with Marpa the Marpa Demonstration Language instead of the "Marpa Language" to emphasize its lack of special status. The documentation for MDL can be found at Parse::Marpa::Doc::MDL.

Namespaces

Actions run in special namespaces unique to each recognizer object. These special namespaces belong entirely to the user.

In the following namespaces, users should use only documented methods:

Parse::Marpa
Parse::Marpa::Grammar
Parse::Marpa::Lex
Parse::Marpa::MDL
Parse::Marpa::Recognizer
Parse::Marpa::Evaluator
Parse::Marpa::Read_Only

Marpa namespaces and variables not mentioned in this section, should not be relied on or modified. Users should use variables in the the Parse::Marpa::Read_Only namespace on a read-only basis. Staying read-only can be tricky when dealing with Perl 5 arrays and hashes. Be careful about auto-vivification!

The $STRING and $START variables are made available to the lex actions. They are also on a read-only basis, except as described in the documentation.

Returns and Exceptions

Most Marpa methods return only if successful. On failure they throw an exception using croak(). If you don't want the exception to be fatal, catch it using eval. A few failures are considered "non-exceptional" and returned. Non-exceptional failures are described in the documentation for the method which returns them.

METHODS

mdl

$first_result =
    Parse::Marpa::mdl( \$grammar_description, \$string_to_parse );

@all_results
    = Parse::Marpa::mdl(\$grammar_description, \$string_to_parse);

$first_result = Parse::Marpa::mdl(
    \$grammar_description,
    \$string_to_parse,
    { warnings => 0 }
);

The mdl static method takes three arguments: a reference to a string containing an MDL description of the grammar; a reference to a string with the text to be parsed; and (optionally) a reference to a hash with options. The available options are described below.

In scalar context, mdl returns a reference to the value of the first parse. In list context, mdl returns a list of references to the values of the parses. If there are no parses, mdl returns undefined in scalar context and the empty list in list context.

Diagnostic Methods

The separate document on diagnostics deals with methods for debugging grammars and parses.

OPTIONS

Marpa has options which control its behavior. These may be set using named arguments when Parse::Marpa::Grammar and Parse::Marpa::Recognizer objects are created, with the Parse::Marpa::Grammar::set method, and with the Parse::Marpa::mdl static method. Except as noted, a recognizer object inherits the Marpa option settings of the grammar from which it was created, and an evaluator object inherits the Marpa option settings of the recognizer from which it was created.

Options for debugging and tracing are described in a separate document on diagnostics. The other Marpa options are listed below, by argument name, and described.

Porcelain interfaces have their own conventions for Marpa options. The documentation of MDL describes, and the documentation of all porcelain interfaces should describe, which options can be set through that interface, and how.

ambiguous_lex

Treats its value as a boolean. If true, ambiguous lexing is used. Ambiguous lexing means that even if a terminal is matched by a regex or a lex action, the search for other terminals at that location continues. If multiple terminals match, all the tokens found are considered for use in the parse. If the parse is ambiguous, it is possible that all the tokens will actually be used. Ambiguous lexing is the default. The ambiguous_lex option cannnot be changed after grammar precomputation.

If ambiguous_lex is false, Marpa does unambiguous lexing, which is the standard in parser generators. With unambiguous lexing, lexing at each location ends when the first terminal matches. The user must ensure the first terminal to match is the correct one. Traditionally, users have done this by making their lex patterns deterministic -- that is, they set their lex patterns up so that every valid input lexes in one and only one way.

Marpa offers users who opt for unambiguous lexing a second alternative. Terminals are tested in order of priority, and the priorities can be set by the user.

code_lines

If there is a problem with user supplied code, Marpa prints the error message and a description of where the code is being used. Marpa also displays the code. The value of code_lines tells Marpa how many lines of context to print. If it's negative, all the code is displayed, no matter how long it is. The default is 3 lines. The code_lines option can be changed at any point in a parse.

If the line with the problem cannot be determined, the first lines of code are printed, up to a maximum of twice code_lines, plus one.

default_action

Takes as its value a string, which must be code in the current semantics. (Right now Perl 5 is the only semantics available.) This value is used as the action for rules which have no explicitly specified action. If default_action is not set, the default action is to return a Perl 5 undefined. default_action cannot be changed once a recognizer object has been created.

default_lex_prefix

The value must be a regex in the current semantics. (Right now Perl 5 is the only semantics available.) The lexers allow every terminal to specify a lex prefix, a pattern to be matched and discarded before the pattern for the terminal itself is matched. Lex prefixes are often used to handle leading whitespace. default_lex_prefix cannot be changed once a grammar is precomputed.

If a terminal has no lex prefix set, default_lex_prefix is used. When default_lex_prefix is not set, the default lex prefix is equivalent to a regex which matches only the empty string.

default_null_value

The value must be an action, that is, a string containing code in the current semantics. (Right now Perl 5 is the only semantics available.) The null value of a symbol is the symbol's value when it matches the empty string in a parse. Null symbol values are calculated when a recognizer object is created. default_null_value cannot be changed after that point.

Symbols with an explicitly set null action use the value returned by that explicitly set action. Otherwise, if there is a default_null_value action, that action is run when the recognizer is created, and the result becomes that symbol's null value. If there is no default_null_value action, and a symbol has no explicitly set null action, that symbol's null value is a Perl 5 undefined. There's more about null values above and in "Null Symbol Values" in Parse::Marpa::Evaluator.

lex_preamble

The value must be a string which contains code in the current semantics. (Right now Perl 5 is the only semantics available.) The lex preamble is run when the recognizer object is created, in a namespace special to the recognizer object. A lex preamble may be used to set up globals. The lex preamble of a recognizer object cannot be changed after the recognizer object has been created.

If multiple lex preambles are specified as named arguments, the most recent lex preamble replaces any earlier one. This is consistent with the behavior of other named arguments, but it differs from the behavior of MDL, which creates a lex preamble by concatenating code strings.

max_parses

The value must be an integer. If it is greater than zero, evaluators will return no more than that number of parses. If it is zero, there will be no limit on the number of parses returned by an evaluator. The default is for there to be no limit. max_parses can be changed at any point in the parse.

Grammars for which the number of parses grows exponentially with the length of the input are common, and easy to create by mistake. This option is one way to deal with that.

online

A boolean. If true, Marpa runs in online or streaming mode. If false, Marpa runs in offline mode. The online option cannot be changed after a recognizer is created. Offline mode is the default, and the only mode supported in this release.

In offline mode, when the first evaluator is created from a recognizer, Marpa assumes that input to the recognizer has ended. The recognizer does some final bookkeeping, and refuses to accept any more input.

preamble

The value must be a string which contains code in the current semantics. (Right now Perl 5 is the only semantics available.) The preamble is run when the evaluator object is created, in a namespace special to the evaluator object. Rule actions and null symbol actions also run in this namespace. A preamble may be used to set up globals. The preamble of a recognizer object cannot be changed after the recognizer object has been created.

If multiple preambles are specified as named arguments, the most recent preamble replaces any earlier one. This is consistent with the behavior of other named arguments, but it differs from the behavior of MDL, which creates a preamble by concatenating code strings.

semantics

The value is a string specifying the type of semantics used in the semantic actions. The current default, and the only available semantics at this writing, is perl5. The semantics option cannot be changed after the grammar is precomputed.

trace_file_handle

The value is a file handle. Warnings and trace output go to the trace file handle. By default it's STDERR. The trace file handle can be changed at any point in a parse.

version

If present, the version option must match the current Marpa version exactly. This is because while Marpa is in alpha, features may change dramatically from version to version. The version option cannot be changed after the grammar is precomputed.

warnings

The value is a boolean. If true, it enables warnings about inaccessible and unproductive rules in the grammar. Warnings are written to the trace file handle. By default, warnings are on. Turning warnings on after grammar precomputation is useless, and itself results in a warning.

Inaccessible and unproductive rules sometimes indicate errors in the grammar design. But a user may have plans for these rules, may wish to keep them as notes, or may simply wish to deal with them later.

IMPLEMENTATION NOTES

Exports and Object Orientation

Marpa exports nothing by default, and allows no optional exports. Use of object orientation in Marpa is superficial. Only grammars, recognizers and evaluators are objects, and they are not designed to be inherited.

Speed

Speed seems very good for an Earley's implementation. Current performance limits are more often a function of the lexing than of the Marpa parse engine.

Ambiguous Lexing

Ambiguous lexing has a cost, and grammars which can turn ambiguous lexing off can expect to parse twice as fast. Right now when Marpa lexes with multiple regexes at a single location, it uses a series of Perl 5 regex matches, one for each terminal.

There may be a more efficient way to find all the matches in a set of alternatives. A complication is that Marpa does predictive lexing, so that the list of lexables is not known until just before the match is attempted. But I believe that lazy evaluation and memoizing could have big payoffs in the cases of most interest.

The Marpa Demonstration Language

The Marpa Demonstration Language was written to demonstrate Marpa's capabilities, including its lexing capabilities. A porcelain interface which doesn't use ambiguous lexing could easily run faster. One with a customized lexer would be faster yet.

If MDL's parsing speed becomes an issue for a particular grammar, that grammar can be precompiled. Subsequent runs of the precompiled grammar don't incur the overhead of either MDL parsing or precomputation. Marpa uses precompilation internally. When you use MDL to specify a grammar to Marpa, Marpa uses a precompiled grammar to parse the MDL.

Comparison with other Parsers

In thinking about speed, it is helpful to keep in mind Marpa's place in the parsing food chain. Marpa parses grammars that bison, yacc, Parse::Yapp, and Parse::RecDescent cannot parse. For these, Marpa is clearly faster. When it comes to time efficiency, never is not hard to beat.

Marpa allows grammars to be expressed in their most natural form. It's ideal where programmer time is important relative to running time. Right now, special-purpose needs are often addressed with regexes. This works wonderfully if the grammar involved is regular, but thousands of man-years have been spent trying to shoehorn non-regular grammars into Perl 5 regexes.

Marpa is a good alternative to parsers that backtrack. Marpa finds every possible parse the first time through. Backtracking is a gamble, and one often made against the odds.

Some grammars have constructs to control backtracking. This control comes at a high price. Solutions with these constructs built into them are as unreadable as anything in the world of programming, and fragile in the face of change to boot.

If you know your grammar will be LALR or regular, that is a good reason not to use Marpa. When a grammar is LALR or regular, Marpa takes advantage of this and runs faster. But such a grammar will run faster yet on a parser designed for it: bison, yacc and Parse::Yapp for LALR; regexes for regular grammars.

Finally, there are the many situations when we want to do some parsing as a one-shot and don't want to have to care what subcategory our grammar falls in. We want to write some quick BNF, do the parsing, and move on. For this, there's Marpa.

DEPENDENCIES

Requires Perl 5.10. Users who want or need the maturity and/or stability of Perl 5.8 or earlier are probably also best off with more mature and stable alternatives to Marpa. Marpa only uses modules that are part of its own distribution, or Perl's.

AUTHOR

Jeffrey Kegler

Why is it Called "Marpa"?

Marpa is the name of the greatest of the Tibetan "translators". In his time (the 11th century AD) Indian Buddhism was at its height. A generation of scholars was devoting itself to producing Tibetan versions of Buddhism's Sanskrit scriptures. Marpa became the greatest of them, and today is known as Marpa Lotsawa: "Marpa the Translator".

Translation in the 11th century was not a job for the indoors type. A translator needed to study in India, with the teachers who had the texts and could explain them. From Marpa's home in Tibet's Lhotrak Valley, the best way across the Himalayas to India was over the Khala Chela Pass. To reach the Khala Chela's three-mile high summit, Marpa had to cross two hundred lawless miles of Tibet. Once a pilgrim crested the Himalayas, the road to Nalanda University was all downhill. Eager to reach their destination, the first travelers from Tibet had descended the four hundred miles straight to the hot plains.

The last part of the journey had turned out to be by far the most deadly. Almost no germs live in the cold, thin air of Tibet. Pilgrims who didn't stop to acclimatize themselves reached the great Buddhist center with no immunity to India's diseases. Several large expeditions reached Nalanda only to have every single member die within weeks.

Blatant Plug

There's more about Marpa in my novel, The God Proof, in which his studies, travels and adventures are a subplot. The God Proof centers around Kurt Gödel's proof of God's existence. Yes, that Kurt Gödel, and yes, he really did work out a God Proof (it's in his Collected Works, Vol. 3, pp. 403-404). The God Proof is available as a free download (http://www.lulu.com/content/933192) and in print form at Amazon.com: http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355.

BUGS

End of Line Comment Cannot be Last in MDL Source

A Perl style end of line comment cannot be last thing in MDL source. Workaround: Add a blank line.

What! You Found More Bugs!

Please report any bugs or feature requests to bug-parse-marpa at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Parse::Marpa

You can also look for information at:

ACKNOWLEDGMENTS

Marpa is derived from the parser described in Aycock and Horspool 2002. I've made significant changes to it, which are documented separately (Parse::Marpa::Doc::Algorithm). Aycock and Horspool, for their part, built on the algorithm discovered by Jay Earley.

I'm grateful to Randal Schwartz for his encouragement over the years that I've been working on Marpa. My one conversation with Larry Wall about Marpa was brief and long ago, but his openness to the idea was a major encouragement, and his insights into how humans do programming, how they do languages, and how those two endeavors interconnect, a major influence. More recently, Allison Randal and Patrick Michaud have been generous with their very valuable time. They might have preferred that I volunteered as a Parrot cage-cleaner, but if so, they were too polite to say.

Many at perlmonks.org answered questions for me. I used answers from chromatic, Corion, dragonchild, jdporter, samtregar and Juerd, among others, in writing this module. I'm just as grateful to those whose answers I didn't use. My inquiries were made while I was thinking out the code and it wasn't always 100% clear what I was after. If the butt is moved after the round, it shouldn't count against the archer.

In writing the Pure Perl version of Marpa, I benefited from studying the work of Francois Desarmenien (Parse::Yapp), Damian Conway (Parse::RecDescent) and Graham Barr (Scalar::Util). Adam Kennedy patiently instructed me in module writing, both on the finer points and on issues about which I really should have know better.

LICENSE AND COPYRIGHT

Copyright 2007-2008 Jeffrey Kegler, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.