NAME

Parse::Marpa - (pre-Alpha) Jay Earley's general parsing algorithm, with LR(0) precomputation

BEWARE: THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE

VERSION

This is Pre-alpha software.

It's strictly a developer's version. Nothing useful will be found here, and the documentation is also inchoate. Those not developing this module will want to wait for at least a released, beta version.

SYNOPSIS

UNDER CONSTRUCTION ...

The Easy Way

It's possible to specify the grammar and the text to be parsed all in one step

use Parse::Marpa;
my $value = Parse::Marpa::marpa(\$grammar, \$text_to_parse);

and you can even include options if you make a hash ref the third argument.

my $value = Parse::Marpa::marpa(
    \$grammar,
    \$text_to_parse
    {
        warnings => 1,
        ...
    }
);

You can get all the values of an ambiguous parse by invoking Parse::Marpa::marpa in list context.

my @values = Parse::Marpa::marpa(\$ambiguous_grammar, \$text_to_parse);

Exercising more control

You can also proceed step by step. For example, ...

use Parse::Marpa;

my @tests = split(/\n/, <<'EO_TESTS');
time  / 25 ; # / ; die "this dies!";
localtime  / 25 ; # / ; die "this dies!";
EO_TESTS

First creating the grammar object ...

my $g = new Parse::Marpa(
    warnings => 1,
    code_lines => -1,
);

Then setting the grammar ...

my $mini_perl_grammar; { local($RS) = undef; $mini_perl_grammar = <DATA> };
$g->set( source => \$mini_perl_grammar);

And then as many times as you like ...

TEST: while (my $test = pop @tests) {

Create a parse object ...

my $parse = new Parse::Marpa::Parse($g);

Pass text to the recognizer ...

$parse->text(\$test);

Set an initial parse ...

$parse->initial();
my @parses;
push(@parses, $parse->value);

Get others, if there are any ...

while ($parse->next) {
    push(@parses, $parse->value);
}

And then go on with your other processing ...

    say "I've got ", scalar @parses, " parses:";
    for (my $i = 0; $i < @parses; $i++) {
        say "Parse $i: ", ${$parses[$i]};
    }
}

__DATA__
...

DESCRIPTION

Parse::Marpa parses text given an arbitrary context-free grammar.

  • The grammar may be anything which can be specified in BNF.

  • This includes left-recursive grammars, right-recursive grammars, grammars with empty productions, grammars with cycles and ambigious grammars.

  • In fact, ambiguous grammars are a Marpa specialty. They are useful even when you never want more than one parse. Human languages have an ambiguous grammars, from which the listener pulls the parse that makes most sense. An ambiguous grammar is often the easiest and most sensible way to express a language. Marpa allows the user to prioritize rules so that the preferred parse comes up first.

  • Of course, you can also have Marpa return all the parses of an ambiguous grammar.

  • Marpa allows predictive lexing and uses LR(0) precomputation. It incorporates the latest academic research for speeding up Earley's algorithm, along with further innovations.

BEWARE! PRE-ALPHA SOFTWARE

Since this is pre-alpha software, users with immediate needs have to look elsewhere.

I've no personal experience with them, but Parse::Yapp and Parse::RecDescent are alternatives to this module which are well reviewed and much more mature and stable.

What to expect once Marpa goes alpha

The alpha version will be intended to let people look Marpa over and even try it out. Uses beyond that are risky.

There will be bugs and misfeatures when I go alpha, but (I hope) no show-stoppers and all with good workarounds. The documentation follows the industry convention of telling the user how Marpa should work. If there's a difference between that and how Marpa actually works currently, it's in the Bugs section, which you'll want to skim before using Marpa.

While Marpa is in alpha, you may not want to automatically upgrade as new versions come out. Versions will often be incompatible. To emphasize this, the version option is required in MDL, and it must match Marpa's version number exactly. That's a hassle, but so is working with alpha software. The version number regime will become less harsh before Marpa leaves beta.

Obviously, while Marpa is in alpha, you won't want to use it for anything with a serious deadline or mission-critical.

IMPORTANT CONCEPTS

The Parse::Marpa::CONCEPTS should be read before actually using Marpa, in fact even before your first careful reading of this document. The "concepts" in it are practical -- the math and the more purely theoretical discussions went into Parse::Marpa::ALGORITHM. Even experts in Earley parsing will want to skim Parse::Marpa::CONCEPTS, because, as one example, the use of ambiguous lexing has unusual implications for term "token".

THE EASY WAY

Parse::Marpa::marpa(grammar, text_to_parse, option hash);

The marpa() method takes three arguments: a reference to a string containing a Marpa source description of the grammar; a reference to a string with the text to be parsed; and (optionally) a reference to a hash with options. If there are no parses, marpa() returns failure. If there are parses, and marpa() was called in list context, it returns a list of references to the values of all the parses. If there are parses, and marpa() was called in scalar context, it returns a reference to the value of the first parse.

The description referenced by the grammar argument must use one of the high-level Marpa grammar interfaces. Currently the default (and only) high-level grammar interface is the Marpa Demonstration Language.

METHODS FOR FINER CONTROL

new Parse::Marpa(option => value, option => value, ...)

The new method takes a series of arguments which are treated as a hash with options as keys and the option values as the hash values. new either throws an exception with croak or returns a new grammar object. For the valid options see the options section.

new Parse::Marpa::Parse(option => value, [option => value, ...])

Parse::Marpa::Parse::new takes as its arguments a series of option name, option value pairs which are handled as a hash. It returns a new parse object or throws an exception using croak.

One of the options must be the grammar option, and it must have as its value a grammar object with rules in it. Any options Parse::Marpa::Parse::new does not directly handle are treated as options for this grammar. See their description in the description the grammar constructor, Parse::Marpa::new, above. The options are documented <below|/"Options">.

Parse::Marpa::Parse::text(parse, text)

Extends the parse in the parse object with the input text, a reference to a string. Returns -1 if the parse is still active after the entire input text has been processed. Otherwise the offset of the character where the parse was exhausted is returned. Failures, other than exhausted parses, are reported as exceptions thrown using croak.

The text is parsed treating each character as an earleme, and using the lexers specified in the source file, or with the raw interface.

The character offset where the parse was exhausted is reported as characters from the start of text. The first character is at offset zero. A zero return from text(), therefore, indicates that the parse was exhausted at the first character.

A parse is "exhausted" at a point in the input where no successful parses are possible. In most contexts and with most applications, an exhausted parse is treated as a failed parse.

Parse::Marpa::show_location(message, text, offset)

message must a string, text a reference to a string, and offset, a character offset within that string. show_location returns a multi-line string with a header line containing message, the line from text containing offset, and a "pointer" line. The pointer line uses the ASCII "caret" symbol to point to the exact offset.

Parse::Marpa::Parse::initial(parse, parse_end)

Performs the recognition phase on a parse. On success, initial() returns a value of 1. The user may then get values of the parse with Parse::Marpa::Parse::value, and interate it with Parse::Marpa::Parse::next.

initial() returns undef if it fails to find a parse. Other failures are thrown as exceptions.

The parse_end argument is optional. If provided, it must be an earleme number specifying where the end of the parse is to be sought. In the standard case, a successful parse in offline mode, the default is to parse to the end of the input, which is usually what the user wants.

In case of an exhausted parse, the default is the point at which the parse was exhausted. Frankly, most of the time that won't be very helpful. An exhausted parse means either that the parse has failed or that a user is using some advanced wizardry with grammars. A failed parse is usually addressed by fixing the grammar or the input, but if the user does want to attempt error recovery, the Parse::Marpa::Parse::find_complete_rule() method may help.

At this point online mode is also bleeding-edge wizardry. It is not well tested. In online mode there is no obvious "end of input". At this moment Marpa doesn't provide a lot of tools for working with online mode. It's up to the user to determine where to look for parses, perhaps using her specific knowledge of the grammar and the problem space.

Parse::Marpa::Parse::next(parse)

Takes an parse object, which must have already been evaluated once with Parse::Marpa::Parse::initial() and performs the next evaluation. Returns 1 if there was another evaluation, undef if there are no more values for this initialization of this parse object. Other failures are thrown as croak() exceptions.

Parses are returned from rightmost to leftmost, but their order may be manipulated by assigning priorities to the rules and to terminals.

Parse::Marpa::Parse::value(parse)

Takes a parse object, which must have been set up with Parse::Marpa::Parse::initial() and possibly iterated with Parse::Marpa::Parse::next() and returns a reference to its current value. Any failures are thrown as croak() exceptions.

In some unusual cases, the return value may be undef, which is considered a Marpa "no value". A "no value" is the value of a rule for which no value was ever calculated. This is not the same as a Perl undefined, which would be returned as a reference to an undefined value.

"No values" are rare. Defaults, nulling rules, and non-existent optional items all result in Perl undefineds, which are considered "calculated values" and value() will return these as a reference to an undefined. A no value will usually be the result of advanced parsing wizardry gone wrong.

LESS USED METHODS

Some of these methods explicitly put the grammar through processing phases which Marpa typically does implicitly, as necessary. For example, when a new parse object is created from a grammar which has not been throught the precomputation phase, that grammar is automatically precomputed, and then deep copied with a compile and a decompile.

A user who wants to trace Marpa's behavior during a processing phase may find it advantageous to run that phase explicitly. Similarly, a user who wants to run diagnostics on the results of a processing phase may wish to have that processing phase run as the result of an explicit method call.

Parse::Marpa::compile(grammar) or $grammar->compile()

The compile method takes as its single argument a grammar object, and "compiles" it, that is, writes it out using Data::Dumper. It returns a reference to the compiled grammar, or throws an exception using croak.

compile() is usually called indirectly by Marpa when a new parse object is created.

Parse::Marpa::decompile(compiled_grammar, [trace_file_handle])

The decompile static method takes a reference to a compiled grammar as its first argument. A second, optional, argument is a file handle both to be used to override the compiled grammar's trace file handle, and for any trace messages compile from decompile() itself. decompile() returns the decompiled grammar object unless it throws an exception using croak.

If decompile() doesn't get the trace file handle explicitly as an argument, the new grammar's trace file handle will revert to the default STDERR. The trace file handle argument is needed because in the course of compilation, the grammar's original trace file handle may have been lost. For instance, if the compiled grammar is written to a file by one process, then read into a second one for decompilation, and the original trace file handle was not STDOUT or STDERR, it's very unlikely to be available in the second process.

When Marpa compiles and decompiles and grammar implicitly, in order to deep copy it, it save the trace file handle and restores it using the trace file handle argument to decompile().

Parse::Marpa::precompute(grammar) or $grammar->precompute()

Takes as its only argument a grammar object and performs the precomputation phase on it. It returns the grammar object or throws an exception using croak.

precompute() is called internally by Marpa when a new parse object is created, and usually the user does not need to call it directly. Cases where the user might find it helpful or necessary to run precompute() directly include running traces on the precomputation; performing diagnostics on the grammar in its precomputed state; or when the user wants to compiling the grammar in his own custom code, instead of using the marpa utility's compile command for that purpose.

OPTIONS

These are the options recognized by the Parse::Marpa::new(), Parse::Marpa::Parse::new(), and Parse::Marpa::set() methods.

grammar

Takes as its value a grammar object. Only valid as an option to Parse::Marpa::Parse::new(). There's no default.

default_action

Takes as its value a string, which is expected to be Perl 5 code. This is used as the action for rules where no action is specified. Defaults to the "non-action" of having the rule return undefined.

ambiguous_lex

Treats its value a boolean. If true, ambiguous lexing is used. This means that even if a terminal in found with a closure or a regex, the search for other terminals at that location continues. If multiple terminals match, all the tokens found are considered in the parse and all may end up being used if the parse is ambiguous.

If false, lexing at a location ends with the first terminal matched, It is up to the user to ensure that lexing is deterministic. (Imposing this requirement on the grammar design is the standard in modern parser generators, and Marpa's option of ambiguous lexing is unusual.) Defaults to ambiguous lexing.

volatile
online

A boolean. If its value is true, the parser runs in online mode. The default is offline mode. In <offline> mode, Marpa assumes the input has ended when the first parse is requested. It does some final bookkeeping in the Earley sets, refuses to accept any more input, and sets its defaults to parse the entire input from beginning to end.

In online mode, which is under construction and poorly tested, new tokens may be added at any point and final bookkeeping is never done. Marpa's default idea is still to parse the entire input up to the current earleme, but it's much less clear that's what the user wants. And if it's not, it up to her has to determine the right places to look for complete parses, based on her knowledge of the structure of the grammar and the input with help from routines supplied for this purpose with Marpa.

source

This takes as its value a reference to a string containing a description of the grammar in the Marpa grammar description language.

default_null_value
default_lex_prefix
version
semantics

These predefineds may be specified as options. If a Marpa predefined is specified both in a source file and as an option, the option overrides the value in the source file. For descriptions of these options, see the documentation for the Marpa grammar description language.

ambiguous_lex
volatile

These are parse and evaluation time options. Setting them in the grammar makes that setting the default for all parses generated from that grammar.

warnings

This is a boolean which enables warnings about inaccessible and unproductive rules in the grammar. Warnings are written to the trace file handle. By default, warnings are on.

Inaccessible rules can those which can never be reached from the start symbol. Unproductive rules are those which no possible input can ever generate. Marpa simply ignores these and generates a parser from the remaining rules.

Often the presence of inaccessible and unproductive rules indicate problems in the grammar. But a user sometimes want to leave them while the grammar is under construction or use them as notes.

code_lines

If there is a problem with user supplied code, Marpa prints the error message and a description of where the code is being used. Marpa will display the code itself as well. The value of this option tells Marpa how many lines to print before truncating the code. If it's zero, no code is displayed. If it's negative, all the code is displayed, no matter how long it is. The default is 30 lines.

preamble

Another predefined which may also be specified as an option. Note that while multiple preambles may be specified in a source file, and they are concatenated, if the preamble option is used it overrides, replacing all the preamble code assembled so far. For a description of preambles, see the documentation for the Marpa grammar description language.

IMPLEMENTATION NOTES

String references

It's often said by those experienced in Perl that passing string refs instead of strings is a pointless and usually counter-productive optimization. I agree. But I believe that Marpa is an exception. Many strings are passed and returned by reference. These are situations where the strings can be very long. Copying and recopying them would be a major waste of time.

Object Orientation

Use of object orientation in Marpa is superficial. Only grammars and parses are objects, and they are not designed to be inherited.

Returns and exceptions

Most Marpa methods return only on success and throw an exception using croak() if there's a failure. If you don't want an exception to be fatal, catch it using eval. Failures are returned only where they are "non-exceptional".

An example of a "non-exceptional" failure is an exhausted parse. Exhausted parses are usually parse failures, though the user may also be doing some advanced wizardry. Even parse failures are often routine -- the user may be testing texts to see if they parse. So <Parse::Marpa::Parse::text()> returns on an exhausted parse, rather than throw an exception.

An even better example is a failed parse in <Parse::Marpa::Parse::next()>. This method is used to iterate through the multiple parses of an ambiguous parse. Its eventual failure, that is, inability to find another parse, is almost always expected and planned for. It returns this failure, rather than throw an exception.

For all methods, any returned failures are specified in the detailed description of the method.

AUTHOR

Jeffrey Kegler

DEPENDENCIES

Requires Perl 5.10. Users who want or need the maturity and/or stability of Perl 5.8 or earlier probably are also best off with more mature and stable alternatives to Marpa.

LIMITATIONS

Speed

As far as I've tested it, speed is remarkably good for an Earley's implementation. In fact, the current bottlenecks seem not to be in the Marpa parse engine, but in the lexing, and in the design of the Marpa Demonstration Language.

Ambiguous Lexing and Speed

Ambiguous lexing has a cost, and grammars which can turn ambiguous lexing off can expect to parse twice as fast. Right now when Marpa tries to lexing multiple regexes at one location, it does so with several regex matches.

I plan to investigate whether there's a more efficient way to have Perl 5 return all of the matches in a set of multiple possibilities. A complication is that since Marpa does predictive lexing, the possibilities are not known until the match is about to be attempted, so precompilation is not possible. I expect that the searches for exactly the same sets of terminals occur many times in a typical parse, and lazy evaluation and memoizing might pay off big

The Marpa Demonstration Language and Speed

The Marpa Demonstration Language was designed to show off Marpa's power, not necessarily to run quickly. This meant that where a feature's usefulness came at a relatively cost in efficiency, I kept it anyway if I thought it demonstrated an important capability of Marpa.

Some General Comments about Speed and Parsers

Potential users of Marpa need to be aware of where Marpa stands in the efficiency battle. If you rewrite a grammar to be LALR grammar, it would parse faster in Marpa, but it would parse faster yet with bison or yacc, and I'd expect with Parse::Yapp as well. And certainly a regular expression will certainly parse faster using regex'es.

When it comes to regexes that use backtracking, however, Marpa may surprise. Marpa can parse any context-free grammar and never needs to backtrack. Backtracking is horrible in worst cases, and in real life is subject to speed meltdowns.

Also worth considering is that if you do come up with the perfect backtracking strategy for your regex, the resulting regex will most likely be unreadable, even with extensive commments, and the strategy will almost certainly break with even a small change in the regex.

The same problem could be described to Marpa in the terms that seem most naturally you. Where and how any changes must be made is likely to be evident, and the speed of any changed grammar is much more likely to be about the same.

BUGS AND MISFEATURES

Options code poorly organized and probably buggy

My strategy for dealing with options to the method calls evolved as I created them and the code shows that. Most options are only valid at certain points in the parsing, but this is haphazardly enforced and poorly documented. There are probably some just plain ol' bugs. The options code needs to be cleaned up.

MDL hardwiring

The assumption that the MDL is the high-level grammar interface in use is hardwired into get_symbol. This will be addressed before going beta.

Timing of Semantics Finalization

All Semantics should be finalized in the next phase after the creation of the Parse object. This is mainly true now, but the value of null objects (for example) is calculated while rules are being added. I need to change Marpa so that all semantics are computed in the phase after creation of the parse object.

Non-intuitive Parse Order in Unusual Cases

This problem occurs when a single production has more than two nullable symbols on the right hand side, so that it is ambiguous, and the order of the parses matters. Most practical grammars don't seem to be affected by this, because it's unusual to use ambiguity within a single rule in the semantics in a way that's impacted. There is a very straightforward workaround where the issue does arise. But the problem needs to be fixed, certainly before Marpa goes beta.

The problem occurs because these productions are rewritten internally by CHAF. A rightmost parse comes first, but it is rightmost for the grammar as rewritten by CHAF. CHAF rewritting is supposed to be invisible, the order is not intuitive, and is probably not the useful choice. Priorities are not a workaround, because priorites cannot be set for rules within a CHAF rewrite.

Workaround: Rewrite the rule for which this parse ordering is a problem. The problem only occurs where a rule is rewritten in CHAF form, and CHAF rewrites are only done to rules with more than two nullables on the right hand side. It is always possible to break up a rule so that at most two nullables occur on the right hand side.

Perl style comments not recognized in some places

Perl style comments are not recognized just before literal regexes and q- and qq-quoted literal strings. The problem is that these are treated as whitespace, whitespace in implemented with lex prefixes, and Marpa internal lexing routines don't implement prefixes. This needs to be fixed.

Workaround: Move the Perl 5 style comment nearby.

Priorities cannot be set in MDL for terminals

Priorities cannot be set in MDL for terminals. I haven't actually needed them, but someone is bound to want them. Fix this before going beta.

Workaround: Use the priorities for rules. As mentioned, I've not found it necessary, but extra rules can be added to simulate terminal priorities.

What! You found even more bugs!

Please report any bugs or feature requests to bug-parse-marpa at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Parse::Marpa

You can also look for information at:

ACKNOWLEDGMENTS

Marpa is the parser described in John Aycock and R. Nigel Horspool's "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630. I've made significant changes to it, which are documented separately (Parse::Marpa::ALGORITHM). Aycock and Horspool, for their part, built on the algorithm discovered by Jay Earley, and described in his "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970.

In writing the Pure Perl version of Marpa, I benefited from studying the work of Francois Desarmenien (Parse::Yapp), Damian Conway (Parse::RecDescent) and Graham Barr (Scalar::Util).

COPYRIGHT & LICENSE

Copyright 2007 Jeffrey Kegler, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.