NAME
Parse::Marpa::Doc::Internals - Marpa Internals
OVERVIEW
Knowledge of Marpa internals is not necessary to use Marpa. But you can make better use of Marpa if you understand how it works. This document assumes that you have read the other Marpa documentation, and that you have worked with another parser generator, or have a general knowledge of parsing.
EARLEY SETS
To speak pedantically, the algorithm in Earley's 1970 parsing article does not actually parse in the strict sense. It is a recognizer, not a parser.
In Marpa, Earley sets are built as tokens are recognized. If an Earley item of the correct form is in the right place in these sets after input is finished, the input is recognized as one of those described by the grammar. Once the recognition phase is done, the phase that does parsing in the strict sense uses the Earley sets as its input.
Each of Marpa's earlemes corresponds one-to-one to an Earley set. An Earley set is a set of Earley items. Each Earley item has an QDFA state, an origin earleme and an dot earleme. An Earley item's dot earleme is also called its current earleme. Here's a representation of an Earley item, as you might see it in debugging output from Marpa:
S5@6-7
This Earley item is for QDFA state 5 (that is the "S5
" part), The "@6-7
" part says that this Earley item originates at earleme 6, and is current at earleme 7. The number of an Earley item's current, or dot, earleme is always the same as number of the Earley set that contains the item.
(Note to experts: Those familiar with Earley parsing will note that S5@6-7
does not look like a traditional Earley item. QDFA states are not traditional -- they are my elaboration of an invention by Aycock and Horspool. And in the traditional notation for an Earley item, the dot earleme is not included at all -- you're required to figure out from context which Earley set an Earley item belongs to.)
QDFA STATES
I will mention QDFA's (quasi-deterministic finite automata) frequently and NFA's (non-deterministic finite automata) sometimes. All you need to know about NFA's is that the QDFA's are built from them. NFA state numbers sometimes appear in the diagnostic outputs. They can be ignored.
About QDFA's, it will be necessary to know a bit more. Let's start with an example of another Earley item:
S1@2-2
This states that this item is for QDFA state 1, that it originates at earleme 2, and that it is currently at earleme 2. We can get a description of the QDFA states from the show_QDFA
method. Here's what it says about QDFA state 1:
S1: predict; 1,5
e ::= . e op e
e ::= . number
<e> => S3
<number> => S4
The first line begins with the QDFA state's label: S1
. After it comes the predict
tag. That indicates this item was created by a prediction. That's one of the four ways an Earley item can be created. There's more on that below.
The two numbers after the "predict
" tag on the first line may be ignored. They are the numbers of the NFA states that were combined into this QDFA state. Marpa uses QDFA states to track the parse. Every QDFA state has one or more LR(0) items. The second and third lines in the display represent the LR(0) items. They are rules with a dot added to indicate how far recognition has proceeded into the rule.
A dot between two symbols indicates that all symbols before the dot have been recognized, but that none of the symbols after the dot have been recognized. A dot at the beginning of an LR(0) item indicates that the rule is a prediction -- none of its symbols have yet been recognized. A dot at the end indicates that the rule is complete -- all of its symbols have been recognized. The location of the dot is called the dot position.
The last two lines in the display show transitions. The first line says that, on seeing an e
, you transition to QDFA state 3. The second states that, on seeing a number
, you transition to QDFA state 4.
It's important to not confuse Earley items with LR(0) items. Earley items are built from one or more LR(0) items. In traditional Earley parsing, each Earley item contained one and only one LR(0) item. This made the internals simpler, but was not as efficient. Marpa combines LR(0) items into QDFA states based on ideas in Aycock and Horspool 2002.
Each QDFA state is a statement about possible parses. The presence of an QDFA state in an Earley item means that the parses in its LR(0) items are possible at the Earley item's dot earleme. The dot position of every LR(0) item in the QDFA state corresponds to the dot earleme of the Earley item. Also, the origin of every LR(0) item is at the origin earleme of the Earley item.
In our example, the dots in all the LR(0) items in S1 (QDFA state 1) are at the beginning of their rules. This means the origin and dot earlemes must be the same, and from the "2-2
" in the description of the Earley item S1@2-2
, you can see that that is the case.
HOW EARLEY SETS ARE BUILT
New items come onto the Earley sets in four ways: scanning, completion, prediction, and initialization.
Scanning
Scanning adds Earley items to indicate which tokens have been seen in the input, and where. Suppose the Earley item S1@2-2 (mentioned above) is present at earleme 2. Marpa knows to instruct the lexer to look for a number
token because there is a transition from S1 to S4 on number
.
If the lexer finds a number
token with a length (in earlemes) of 1 at earleme 2, a new Earley item S4@2-3 is added at earleme 3. (Earleme 3 because 3 is 2+1, the current token plus the token length.) Here (again from show_QDFA
) is the description of QDFA state 4.
S4: 6
e ::= number .
We can see that there is only one LR(0) item in QDFA state 4, and that it has the dot pointer at the end of its rule. A rule with the dot pointer at the end in an LR(0) item is a completed rule.
Marpa calls the item that was looking for the scanned symbol the predecessor of the item added by the scan. In this example, S1@2-2 is the predecessor of S4@2-3. Any item which is added to an Earley set based on an QDFA state transition from a predecessor, is called that predecessor's successor. In this example, S4@2-3 is the successor of S1@2-2.
An Example
Here's what S4@2-3 looks like in the Earley sets, after a successful scan of the input "2+2"
.
S4@2-3 [p=S1@2-2; t=2]
In the first line we see that, as mentioned, S1@2-2 is the predecessor of S4@2-3. It has no successor. The effect
item will be explained later. pre-dot
is the name of the symbol before the dot pointer in the current choice of rule. (Remember that there may be more than one rule in an Earley item.) lhs
is the symbol on the left hand side of the current rule. If the Earley item has had a value computed, that is also shown. In this case, the value came from the token and is "2".
The last two lines are very significant. They show the choices Marpa is making. This parse is not ambiguous. In this Earley item, there is only one possible token and only one choice of rule, so there are no non-trivial choices.
Each token choice may have a different predecessor. Token choices are shown as [p=
predecessor; t=
token_value]
pairs. In the above example, the token choice [p=S1@2-2; t=2]
indicates that the token has a value of "2", and that its predecessor was S1@2-2.
Rule choices are also shown in brackets. In the rule choice [ 1: e -> number ]
, "1
" is the rule number and "e -> number
" is the rule.
Completion
Whenever a rule is completed in an Earley set, new rules may be added to that Earley set as a result. The item with the completed rule is called the cause, and the new item added as a result is the effect.
Let's look at one example of a completion "cause and effect". As the cause in our example, we'll use S4@2-3, the Earley item in our example of a scanned item. Its state was QDFA state 4. QDFA state 4 contained the LR(0) item
e ::= number .
The dot is at the far right, so it's a completed rule whose the left hand side is e
.
S4@2-3 begins at earleme 2 and has a complete rule with e
as its lhs, and therefore any Earley item at earleme 2 which is looking for an e
is in luck. S5@0-2 is such a rule. QDFA state 5 looks like this:
S5: 3
e ::= e op . e
<e> => S6
This says that in QDFA state 5, when you see an e
, you transition to QDFA state 6. Here's S6@0-3, the item which is the effect of the completion:
S6@0-3 [p=S5@0-2; c=S4@2-3]
(You'll notice that S6@0-3 is not just an effect, but also has its own effect. As noted on the first line, it is the cause of item S2@0-3.) The meanings of the terms predecessor and successor are similar to what they are in token scanning. Since S6@0-3 comes from moving the dot forward in a rule in S5@0-2, we say S5@0-2 is the predecessor of S6@0-3, and that S6@0-3 is the successor of S5@0-2.
The same Earley item can be added to the same Earley set for multiple reasons. A rule can be added because of one or more scanned tokens; because of one or more completions; or because of any combination of these.
Traditionally in Earley's algorithm, the same item is not added twice. Where an item comes from is not important in recognition. But parsing in the strict sense means determining the structure of the input, and that requires us to know why an Earley item was added to an Earley set. Marpa (again following the lead of Aycock and Horspool) tracks the reasons Earley items were added to the Earley sets using links.
We've already seen a link in the scanning example. Earley item S6@0-3 has another example of a link: [p=S5@0-2; c=S4@2-3]
. This says that one reason for S6@0-3 to exist was a completion, with S4@2-3 as the cause and S5@0-2 the predecessor. Since there are no other links, this is the only reason for S6@0-3 to be in the Earley sets.
The parse in this example is unambiguous. If it were ambiguous, you would see multiple links in some of the Earley items.
Prediction
A third source of Earley items is prediction. Whenever an Earley item is expecting to see an e
, for example, it can also expect to see any symbols that are at the start of any rules that have an e
on their left hand side.
Recall QDFA state 5
S5: 3
e ::= e op . e
<e> => S6
QDFA state 5 is expecting an e
. This means that the rules
e ::= . e op e
e ::= . number
can also be expected to originate at this earleme. Marpa is smart about grouping rules into QDFA states and both these rules are in QDFA state 1:
S1: predict; 1,5
e ::= . e op e
e ::= . number
<e> => S3
<number> => S4
This means that since S5@0-2 is at earleme 2, we should also expect to see an item for QDFA state 1 at earleme 2, and we do:
S1@2-2
Every rule in a predicted item will have the dot at the beginning. That means the origin and dot earlemes will always be the same.
The descriptions of predicted items are often short. The description of S1@2-2 in our output was one line, the shortest description of an Earley item that we've seen. That's typical. Prediction items are essential to get rules started (note that S1@2-2 does have a successor). But, since they are not actually needed for parsing and evaluation, they don't need to have links or a lot of the data that the other items have.
Initialization
Some Earley items are put into Earley set 0 to start things off. In our example they are
S0@0-0
S1@0-0
QDFA state 0 contains the start rules, with the dot pointer at the beginning. Only Earley set 0 will contain an Earley item for QDFA state 0. In our example, S1@0-0 contains rules predicted by S0@0-0.
HOW RECOGNITION WORKS
As mentioned, what's usually called Earley's algorithm is just a recognizer, an algorithm to build the Earley sets. The input is recognized successfully if, in the last earleme, there is a an Earley item for a completed start rule state. A completed start rule state is an QDFA state containing a completed start rule. A completed start rule is the LR(0) item for a start rule that has its dot position at the end.
At most two of the QDFA states can be completed start rule states. One is a special case. In a null parse, the start state, QDFA state 0, is also a completed start rule state.
The other completed start rule state is not hard to spot. Marpa always adds its own, new, internal, start rule, with a special internal start symbol as its lhs, and the original start symbol as its rhs. The completed start rule state for non-null parses contains the LR(0) item which has Marpa's internal start rule and has the dot pointer at the end of that rule. The completed start rule state in our example is QDFA state 2:
S2: 8
e['] ::= e .
A parse is successful at earleme N it contains an Earley item for a completed start rule state with a origin earleme of 0 and a dot earleme of N. Here's the Earley item which makes the parse in our example successful at earleme 3:
S2@0-3 [p=S0@0-0; c=S6@0-3]
GRAMMAR REWRITING
Marpa rewrites grammars, adding internal symbols and rules in the process. This rewriting does not affect the semantics, but it does show up when you examine the internals.
Marpa's internal symbols have tags at the end, enclosed in square brackets. This means all Marpa's internal symbols end in a right square bracket.
Adding a Start Rule
For convenience, many parsers add their own start rule and their own start symbol to grammars. Marpa is no exception. The new start symbol is the old one with "[']
" suffixed. We saw a Marpa internal start symbol above: e[']
. If the grammar allows a null parse, there will also be a nulling start symbol, with "[]
" suffixed.
Elminating Proper Nullable Symbols
Nulling symbols are those which always produce the empty sentence. Nullable symbols are those which can produce the empty sentence. Non-nullable symbols are those which never produce the empty sentence.
Pedantically, nulling symbols are also nullable symbols. A proper nullable is a symbol which, in the grammar of interest, can produce either the empty sentence or a non-empty sentence. In other words, a proper nullable is any nullable symbol which is not a nulling symbol.
Nullable symbols have been a headache for previous versions of Earley parsers. Aycock and Horspool 2002 outlined a new approach for dealing with them. I use their ideas and add a few of my own.
Marpa rewrites its grammar to eliminate proper nullables. It does this by turning the original proper nullable into a non-nullable symbol, and creating a second symbol to serve as the original symbol's nullable variant. The non-nullable variant of the original symbol keeps the original name, but is no longer allowed to appear in places where it might be nulled. In places where the original symbol was nulled, the nulling variant is substituted. The name of the nulling variant is that of the original symbol with the nulling tag ("[]
") suffixed.
The newly introduced nulling symbols will not appear on any left hand sides. Marpa marks nulling symbols internally and recognizes them directly, so it does not need empty rules for nulling symbols.
More difficult is the problem of dealing with the cases where the proper nullables appeared on the right hand side of rules. These rules have to be replaced with new rules covering every possible combination of the non-nullable and nulling variants. That rewrite is described in the next section.
CHAF Rewriting
To deal with the splitting of proper nullables into two symbols, one non-nullable and one nullable, Aycock and Horspool simply created new rules for all possible combinations on the right hand side. This factoring is exponential in the worst case. I don't like leaving exponential explosions in an algorithm, even unlikely ones. That old Theory of Computation course still haunts me after all these years. And I felt the generation of all possible combinations for an arbitrarily long right hand side might be inefficient in practice.
A result due to Chomsky shows that any grammar can be rewritten as a grammar with at most two symbols on the right hand side. Relaxing Chomsky's rewrite to allow right hand sides with any number of symbols, but at most two proper nullables, produces a rewrite I call CHAF (Chomsky-Horspool-Aycock Form).
CHAF changes the worst case to linear, and in practical cases lowers the multiplier. Here's an example of a CHAF rewrite from Marpa's own self-grammar. First, the rule:
production paragraph:
non structural production sentences,
production sentence,
non structural production sentences,
optional action sentence,
non structural production sentences.
This rule contains four proper nullables, reinforcing my fear that grammars written as test cases won't be the only ones with lots of proper nullables on the right hand side. non structural production sentences
and optional action sentence
are both proper nullables and non structural production sentences
appears three times.
Here's is the output from show_rules
, showing what Marpa did with this rule:
12: production-paragraph
-> non-structural-production-sentences
production-sentence
non-structural-production-sentences
action-sentence:optional
non-structural-production-sentences /* !useful */
99: production-paragraph ->
non-structural-production-sentences
production-sentence
production-paragraph[R12:2][x5e]
/* priority=0.3 */
100: production-paragraph ->
non-structural-production-sentences[]
production-sentence
production-paragraph[R12:2][x5e]
/* priority=0.1 */
101: production-paragraph ->
non-structural-production-sentences
production-sentence
production-paragraph[R12:2][x5e][]
/* priority=0.2 */
102: production-paragraph ->
non-structural-production-sentences[]
production-sentence
production-paragraph[R12:2][x5e][]
103: production-paragraph[R12:2][x5e] ->
non-structural-production-sentences
production-paragraph[R12:3][x60]
/* priority=0.3 */
104: production-paragraph[R12:2][x5e] ->
non-structural-production-sentences[]
production-paragraph[R12:3][x60]
/* priority=0.1 */
105: production-paragraph[R12:2][x5e] ->
non-structural-production-sentences
production-paragraph[R12:3][x60][]
/* priority=0.2 */
106: production-paragraph[R12:3][x60] ->
action-sentence:optional
non-structural-production-sentences
/* priority=0.3 */
107: production-paragraph[R12:3][x60] ->
action-sentence:optional[]
non-structural-production-sentences
/* priority=0.1 */
108: production-paragraph[R12:3][x60] ->
action-sentence:optional
non-structural-production-sentences[]
/* priority=0.2 */
Rule 12 is the original rule. Because Marpa has rewritten it, the rule is marked !useful
, telling later stages in the precomputation to ignore it. Marpa breaks Rule 12 up into three pieces, each with no more than two proper nullables. Rules 99 to 102 are the first piece, with the first two symbols from Rule 12. Rules 103 to 105 are the second, with the 3rd symbol. Rules 106 to 108 are the third, with the 4th and 5th symbols from Rule 12.
Each piece is factored, so that every combination of nulling and non-nullable symbols is included. New symbols are introduced to be the left hand sides of the pieces. The tag "[R12:3]
" indicates that this symbol is the left hand side for the piece of Rule 12 which begins at right hand symbol 3 (the first symbol is symbol 0). The tags beginning with an "x
", like "[x5d]
", are arbitrary unique hex values, inserted to guarantee that the new symbols are unique.
This rule is a worst case for CHAF, because the last three symbols of the right hand side are all proper nullables. That means that the last two pieces of the original rule can be either empty or non-empty, and therefore that both of the newly created lhs symbols are also proper nullables.
There are now a total of 6 proper nullables: the original 4 plus the 2 symbols newly created to serve as left hand sides. This is why, in order to have only 2 proper nullables per piece, the original rule needed to be divided into 3 pieces. The newly created lhs symbols, because they are proper nullables, need to be split into nulling and non-nullable variants and factored, just like the proper nullables in the original rule.
Nonetheless this factoring can be done with 10 rules in CHAF, while the original Aycock-Horspool factoring (NNF) required 16. After more than 4 proper nullables, the advantage of CHAF becomes overwhelming. With 5 proper nullables, there would be 13 rules for CHAF versus 32 for NNF. With 6 proper nullables, 16 versus 64.
The user's semantics are preserved, because Marpa, while splitting the rule into pieces and factoring the pieces, inserts logic to gather and preserve the values of child nodes. These values are presented to the user's actions as if no CHAF rewrite had occurred.
Converting Sequence Productions to BNF
Internally, Marpa converts productions specified as sequences into BNF productions. The conversion is done in a standard way. For example,
paragraphs: empty line separated paragraph sequence.
becomes
2: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5]
3: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5] empty-line
4: paragraph[Seq:1-*][Sep:empty_line][x5] -> paragraph
5: paragraph[Seq:1-*][Sep:empty_line][x5] ->
paragraph[Seq:1-*][Sep:empty_line][x5] empty-line paragraph
In the added symbol, the tag "[Seq:1-*]
" indicates this is a symbol for a sequence of from 1 to an infinite number of symbols and the tag "[Sep:empty_line]
" that it is empty_line
separated.
Here's another example, this time of a sequence without a separator:
definition paragraph: definition sequence. concatenate lines.
is written as the following BNF:
9: definition-paragraph -> definition[Seq:1-*][xa]
10: definition[Seq:1-*][xa] -> definition
11: definition[Seq:1-*][xa] -> definition[Seq:1-*][xa] definition
SELF-TUTORIALS
If you want to investigate internals more on your own, here are two "self-tutorials", which should make you pretty much an expert.
First, go through the show_status
output in the appendix. For each Earley item in it, reason out how it came to be there.
Second, take the grammar used in the example and run it on the input text "2+2*3
". While the parse in this document was not ambiguous, the grammar was. The grammar's ambiguity reveals itself when there is more than one operation in the input string. Get the show_status
output from any one of the ambiguous parses and reason out how all of its Earley items came to be.
APPENDIX: DATA FOR THE EXAMPLE
Below are the MDL grammar and outputs for the example used in this document. The input text was "2+2
".
The MDL Grammar
semantics are perl5. version is 0.211.7.
start symbol is E.
E: E, Op, E.
q{
my ($right_string, $right_value)
= ($_[2] =~ /^(.*)==(.*)$/);
my ($left_string, $left_value)
= ($_[0] =~ /^(.*)==(.*)$/);
my $op = $_[1];
my $value;
if ($op eq "+") {
$value = $left_value + $right_value;
} elsif ($op eq "*") {
$value = $left_value * $right_value;
} elsif ($op eq "-") {
$value = $left_value - $right_value;
} else {
croak("Unknown op: $op");
}
"(" . $left_string . $op . $right_string . ")==" . $value;
}.
E: Number.
q{
my $v0 = pop @_;
$v0 . "==" . $v0;
}.
Number matches qr/\d+/.
Op matches qr/[-+*]/.
the default action is q{
my $v_count = scalar @_;
return "" if $v_count <= 0;
return $_[0] if $v_count == 1;
"(" . join(";", @_) . ")";
}.
The show_QDFA
Output
Start States: S0; S1
S0: 7
e['] ::= . e
<e> => S2
S1: predict; 1,5
e ::= . e op e
e ::= . number
<e> => S3
<number> => S4
S2: 8
e['] ::= e .
S3: 2
e ::= e . op e
<op> => S1; S5
S4: 6
e ::= number .
S5: 3
e ::= e op . e
<e> => S6
S6: 4
e ::= e op e .
The show_earley_sets
Output
Current Earley Set: 4; Furthest: 3
Earley Set 0
S0@0-0
S1@0-0
Earley Set 1
S4@0-1 [p=S1@0-0; t=2]
S2@0-1 [p=S0@0-0; c=S4@0-1]
S3@0-1 [p=S1@0-0; c=S4@0-1]
Earley Set 2
S5@0-2 [p=S3@0-1; t=+]
S1@2-2
Earley Set 3
S4@2-3 [p=S1@2-2; t=2]
S6@0-3 [p=S5@0-2; c=S4@2-3]
S3@2-3 [p=S1@2-2; c=S4@2-3]
S2@0-3 [p=S0@0-0; c=S6@0-3]
S3@0-3 [p=S1@0-0; c=S6@0-3]
The show_bocage
Output
package: Parse::Marpa::E_1; parse count: 0
S2@0-3L3 ::= S6@0-3L0
S6@0-3L0 ::= S5@0-2R0:3 S4@2-3L0
S5@0-2R0:3 ::= S3@0-1R0:2 '+'
S4@2-3L0 ::= '2'
S3@0-1R0:2 ::= S4@0-1L0
S4@0-1L0 ::= '2'
The show_tree
Output
Tree Node: S2@0-3L3[0]; Depth=0; Argc=1
Rule: e['] ::= e .
Cause: S6@0-3L0
Tree Node: S6@0-3L0[0]; Depth=1; Argc=3
Rule: e ::= e op e .
Predecessor: S5@0-2R0:3
Cause: S4@2-3L0
Tree Node: S4@2-3L0[0]; Depth=2; Argc=1
Rule: e ::= number .
Tree Node: S5@0-2R0:3[0]; Depth=2; Argc=3
Rule: e ::= e op . e
Predecessor: S3@0-1R0:2
Tree Node: S3@0-1R0:2[0]; Depth=3; Argc=3
Rule: e ::= e . op e
Cause: S4@0-1L0
Tree Node: S4@0-1L0[0]; Depth=4; Argc=1
Rule: e ::= number .
SUPPORT
See the support section in the main module.
AUTHOR
Jeffrey Kegler
LICENSE AND COPYRIGHT
Copyright 2007 - 2008 Jeffrey Kegler
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.