NAME
Parse::Marpa::ALGORITHM - The Marpa Algorithm
THE ALGORITHM
Marpa is based on and derived from the parser described in John Aycock and R. Nigel Horspool's "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630. This combines LR(0) precomputation with Earley's general parsing algorithm. Jay Earley's most accessible paper is "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
The Aycock and Horspool paper is on the web (http://www.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf), unlike Earley's, and it summarizes Earley's very nicely. It is not, however, easy reading. I have a degree in the subject, and have been following this particular topic on and off for years. Nonetheless I found the Aycock and Horspool paper very difficult going.
Someday I'd like to describe the whole algorithm, including the core contributions of Aycock, Horspool and Earley. at a level that someone who has not taken a course in parsing could understand. Marpa combines many parsing techniques, and a full description of it would be a small textbook. For now, I have to content myself with briefly describing the more significant ideas new with Marpa, assuming that the reader is familiar with parsing in general and Earley parsing in particular, so that he understands terms like Earley set and Earley item.
Upside-down and Inside-out
Marpa takes the parse engine as described by Horspool and Aycock and literally turns it inside out and upside down. In the Horspool and Aycock version of Earley's, the main loop iterated over each item of an Earley set, first scanning for tokens, then completing items in the set. Marpa turns this into two separate loops over the Earley items, the first of which completes the current earley set, and the second of which scans for tokens.
The advantage of this double inversion is that the lexical analysis can be put between the two loops, -- after completion, but before scanning. This makes predictive lexing possible. Being able to predict which lexables are legal at a given point in the parse can save a lot of processing, especially if the lexing is complex. Any lexemes which are not predicted by items in the current Earley set do not need to be scanned for. Now that Hayspool and Aycock have sped up Earley's algorithm, the time spent lexing is a significant factor in overall speed. Predictive lexing can reduce lexing time to a fraction of the original.
Earlemes
Marpa allows ambiguous lexing, including recognition of lexemes of different lengths starting at the same lexical position, and recognition of overlapping lexemes. To facilitate this, Marpa introduces the "earleme" (named after Jay Earley). Previous implementations required the Earley parser's input to be broken up into tokens, usually by lexical analysis of the input using DFA's. (DFA's -- deterministic finite automata -- are the equivalent of regular expressions when that term is used in the strictest sense). Requiring that the first level of analysis be performed by a DFA hobbles a general parser like Earley's.
Marpa allows the scanning phase of Earley's algorithm to add items not just to the the next Earley set, but to any later one. Several alternative scans of the input can be put into the Earley sets, and the power of Earley's algorithm harnessed to deal with the indeterminism.
Marpa does this by allowing each scanned item to have a length in "earlemes", call it l. If the current Earley set is i, a newly scanned Earley item is added to Earley set i+l. In other words, the earleme is a unit of distance measured in Earley sets. An implementation can sync earlemes up with any measure that's convenient. For example, the distance in earlemes may be the length of a string, as measured either in bytes or UNICODE graphemes. An implementation may also mimic traditional lexing by defining the earleme to be the same as the distance in a token stream, measured in tokens.
Chomsky-Horspool-Aycock Form
Marpa's third significant change to the Aycock and Horspool algorithm is in internal rewriting of the grammar. Aycock and Horspool call their rewriting NNF (Nihilist Normal Form). Earley's original algorithm had serious issues with nullable symbols and productions, and NNF fixes most of them. (A nullable symbol or production is one which could eventually parse out to the empty string.) Importantly, NNF also allows complete and easy mapping of the semantics of the original grammar to its NNF rewrite, so that NNF and the whole rewrite process can be made invisible to the user.
My problem with NNF grammar is that the rewritten grammar is exponentially larger than the original in the theoretical worst case, and I just don't like exponential explosion, even as a theoretical possibility in pre-processing. I also think that there are cases likely to arise in practice where the size explosion, if not exponential, is linear with a very large multiplier. One such such case would be a Perl 6 rule where whitespace was significant but optional.
My solution is Chomsky-Horspool-Aycock Form (CHAF). This is Horspool and Aycock's NNF, but with the further restriction that no more than two nullable symbols may appear in any production. (The idea that any context-free grammar can be rewritten into productions of at most a small fixed size appears to originate, like so much else, with Noam Chomsky.) The shortened CHAF production maps back to the original grammar, so that like NNF, the CHAF rewrite can be made invisible to the user. With CHAF, the theoretical worst behavior is linear, and in those difficult cases likely to arise in practice the multiplier is much smaller.