NAME
Parse::Marpa::Doc::Algorithm - The Marpa Algorithm
DESCRIPTION
Marpa is derived from the parser described by John Aycock and R. Nigel Horspool. Aycock and Horspool combined LR(0) precomputation with the general parsing algorithm described by Jay Earley in 1970.
Marpa combines many parsing techniques, and a full description would be a small textbook. For now, I have to content myself with briefly describing the more significant ideas new with Marpa. In this document I assume that the reader is familiar with parsing in general, and that he understands terms like Earley set and Earley item. It will be helpful to have read the other Marpa documents, up to and including Parse::Marpa::Doc::Internals.
All claims of originality are limited by my ignorance. The parsing literature is large, and I may have been preceded by someone whose work I'm not aware of. Also, as readers familiar with the academic literature will know, a claim that combining A with B is original with Marpa, is not to be read as a claim to be that either A or B is original with Marpa.
A New Recognition Engine
Marpa takes the recognition engine as described by Horspool and Aycock and turns it inside out and upside down. In the Horspool and Aycock version of Earley's, the main loop iterated over each item of an Earley set, first scanning for tokens, then completing items in the set. Marpa turns this into two separate loops over the Earley items, the first of which completes the current earley set, and the second of which scans for tokens.
Predictive Lexing and Earley Parsing
The advantage of this double inversion is that the lexical analysis can be put between the two loops -- after completion, but before scanning. This makes predictive lexing possible. Being able to predict which lexables are legal at a given point in the parse can save a lot of processing, especially if the lexing is complex. Any lexemes which are not predicted by items in the current Earley set do not need to be scanned for. Now that Hayspool and Aycock have sped up Earley's algorithm, the time spent lexing is a significant factor in overall speed. Predictive lexing can reduce lexing time to a fraction of the original.
Earlemes
Marpa allows ambiguous lexing, including recognition of lexemes of different lengths starting at the same lexical position, and recognition of overlapping lexemes. To facilitate this, Marpa introduces the earleme (named after Jay Earley). Previous implementations required the Earley parser's input to be broken up into tokens, usually by lexical analysis of the input using DFA's. (DFA's -- deterministic finite automata -- are the equivalent of regular expressions when that term is used in the strictest sense). Requiring that the first level of analysis be performed by a DFA hobbles a general parser like Earley's.
Marpa allows the scanning phase of Earley's algorithm to add items not just to the next Earley set, but to any later one. Alternative scans of the input can be put into the Earley sets, and the power of Earley's algorithm harnessed to deal with the indeterminism.
Marpa does this by allowing each scanned item to have a length in earlemes. Call that length L. If the current Earley set is I, a newly scanned Earley item is added to Earley set I+L. From this point of view, the earleme is a unit of distance measured in Earley sets. In the same way that distances in miles can be seen as locations when they are associated with milestones, the individual Earley sets can also be seen as earleme milestones. The first Earley set is earleme zero. Subsequent Earley sets are called earleme 1, earleme 2, etc., based on their distance in earlemes from earleme 0.
An implementation can sync earlemes up with any measure that's convenient. For example, the distance in earlemes may be the length of a string, measured either in bytes or UNICODE graphemes. An implementation may also mimic traditional lexing by defining the earleme to be the same as the distance in a token stream, measured in tokens.
Chomsky-Horspool-Aycock Form
Marpa's third significant change to the Aycock and Horspool algorithm is its internal rewriting of the grammar. Aycock and Horspool rewrote their grammars internally into what they called NNF (Nihilist Normal Form). Earley's original algorithm had serious issues with nullable symbols and productions, and the NNF rewrite fixes almost all of them. (A nullable symbol or production is one which derives the empty sentence.) Importantly, NNF also allows complete and easy mapping of the semantics of the original grammar to its NNF rewrite, so that NNF and the whole rewrite process can be made invisible to the user.
My problem with NNF is that the rewritten grammar is exponentially larger than the original in the theoretical worst case, and I just don't like exponential explosion, even as a theoretical possibility in pre-processing. I also think cases can arise in practice where the NNF size explosion is a real problem. One such case might be Perl 6 rules in which whitespace is significant but optional.
My solution is Chomsky-Horspool-Aycock Form (CHAF). This is Horspool and Aycock's NNF, but with the further restriction that no more than two nullable symbols may appear in any production. (The idea that any context-free grammar can be rewritten into productions of at most a small fixed size appears to originate, like so much else, with Noam Chomsky.) The shortened CHAF production maps back to the original grammar, so that like NNF, the CHAF rewrite can be made invisible to the user. With CHAF, the theoretical worst behavior is linear, and in those difficult cases likely to arise in practice the multiplier is much smaller.
Iterating Aycock-Horspool Parses
Aycock and Horspool give an algorithm for constructing a rightmost derivation from their version of the Earley sets. They suggest that in the case of multiple parses, their Earley sets could be iterated through, and they point out where the decision points occur in their algorithm. But they give no algorithm to do that iteration. Marpa's solution is new, as far as I know.
Precedence Parsing and Earley Parsing
In order to iterate through the parses, Marpa stores, with most of the Earley items, information about why it was added. Marpa uses lists of "links" and tokens to this, expanding on ideas from Aycock and Horspool. I noticed that the order of these lists controls the order of the parses, and realized that the order of those lists could be changed to anything useful. The order of the link and token lists might, in fact, be based on priorities assigned by the user to the rules and tokens of the grammar. These user-assigned priorities could then force the most desirable parse, instead of the rightmost, to be first in parsing order.
In effect, Marpa's rule and terminal priorities turn ambiguous parsing into a convenient form of precedence parsing. I don't believe Earley parsing and precedence parsing have been combined before.
Quasi-Deterministic Finite Automata (QDFA's)
Aycock and Horspool's precomputation uses what they call a split LR(0) ε-DFA. Not exactly the most transparent name, but a brilliant concept: if you left in some of the non-determinism when you transformed an NFA to LR(0) DFA, you could use it to guide the Earley recognizer.
While the concept gave me an essential insight that I could never have come up on my own, I found the split LR(0) ε-DFA awkward to program. A point came when I wanted to add some more non-determinism, and then I realized that not only would an ordinary NFA allow my change, but it would make the code shorter and better.
What Marpa does, therefore, is to transform the original NFA into a second NFA. The second NFA keeps the form of an NFA, but has most of the non-determinism taken out of it. It is almost an LR(0) DFA, so I call it a quasi-deterministic finite automaton, or QDFA.
SUPPORT
See the support section in the main module.
AUTHOR
Jeffrey Kegler
COPYRIGHT
Copyright 2007 - 2008 Jeffrey Kegler
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.