NAME
Marpa::XS::Advanced::Implementation - Marpa Implementation
OVERVIEW
This information in this document is not necessary for ordinary users of Marpa. This document describes the Marpa data structures. It assumes that the reader knows the Marpa API, and that she has a general knowledge of parsing. All the examples in this document assume that none of the data has been stripped.
GRAMMAR REWRITING
Marpa rewrites grammars, adding internal symbols and rules in the process. These rewritings do not affect the semantics, but they do show up when you examine the internals.
Marpa's internal symbols have tags at the end, enclosed in square brackets. This means that all Marpa internal symbols end in a right square bracket.
Adding a Start Rule
Many parsers add their own start rule and their own start symbol to grammars. Marpa is no exception. The new internal start symbol is the original start symbol with "[']
" suffixed. An example of a Marpa internal start symbol is "Expression[']
". If the grammar allows a null parse, there will also be a Marpa internal nulling start symbol, with "['][]
" suffixed. An example of a Marpa internal nulling start symbol would be "Script['][]
".
Elminating Proper Nullable Symbols
A symbol is nulled if it produces the empty sentence. Nulling symbols are those which are always nulled. Nullable symbols are those which are sometimes nulled. Non-nullable symbols are those which are never nulled.
All nulling symbols are also nullable symbols. A proper nullable is any nullable symbol which is not a nulling symbol. In other words, a proper nullable is a symbol that might or might not be nulled.
Nullable symbols were a problem in previous versions of Earley parsers. Aycock and Horspool 2002 outlined a new approach for dealing with them. I use their ideas with modifications of my own.
Marpa rewrites its grammar to eliminate proper nullables. It does this by turning every proper nullable into two symbols: a non-nullable variant and a nulling variant. The non-nullable variant keeps the original symbol's name, but is no longer allowed to appear in places where it might be nulled.
The name of the nulling variant is that of the original symbol with the nulling tag ("[]
") suffixed. When the nulling variant is used, it must be nulled.
The newly introduced nulling symbols will not appear on any left hand sides, with one exception: grammars that allow a null parse will have a nulling start rule. Except for the nulling start symbol, Marpa marks nulling symbols internally and recognizes them directly, without the need for empty rules.
Rules with proper nullables on the RHS have to be replaced with new rules covering every possible combination of the non-nullable and nulling variants. That rewrite is called the CHAF rewrite.
CHAF Rewriting
Here's an example of a CHAF rewrite:
{ lhs => 'statement',
rhs => [
qw/optional_whitespace expression
optional_whitespace optional_modifier
optional_whitespace/
]
}
This rule contains four instances of proper nullables, illustrating my fear that grammars of practical interest will have lots of proper nullables on right hand sides. optional_whitespace
and optional_modifier
are both proper nullables and optional_whitespace
appears three times.
Here's is the output from show_rules
, showing what Marpa does with this rule:
0: statement -> optional_whitespace expression optional_whitespace optional_modifier optional_whitespace
/* !used */
15: statement -> optional_whitespace expression statement[R0:2] /* vrhs real=2 */
16: statement -> optional_whitespace expression optional_whitespace[] optional_modifier[] optional_whitespace[]
17: statement -> optional_whitespace[] expression statement[R0:2] /* vrhs real=2 */
18: statement -> optional_whitespace[] expression optional_whitespace[] optional_modifier[] optional_whitespace[]
19: statement[R0:2] -> optional_whitespace statement[R0:3] /* vlhs vrhs real=1 */
20: statement[R0:2] -> optional_whitespace optional_modifier[] optional_whitespace[] /* vlhs real=3 */
21: statement[R0:2] -> optional_whitespace[] statement[R0:3] /* vlhs vrhs real=1 */
22: statement[R0:3] -> optional_modifier optional_whitespace /* vlhs real=2 */
23: statement[R0:3] -> optional_modifier optional_whitespace[] /* vlhs real=2 */
24: statement[R0:3] -> optional_modifier[] optional_whitespace /* vlhs real=2 */
Rule 0 is the original rule. Because Marpa has rewritten it, the rule is marked "!used
", telling later stages in the precomputation to ignore it. Marpa breaks Rule 0 up into three pieces, each with no more than two proper nullables. Marpa then eliminates the proper nullables in each piece by factoring.
To factor a piece, Marpa first rewrites it into multiple rules, one for each possible combination of nulled and non-nulled symbols. Marpa then replaces each proper nullable with its nulling or non-nullable variant, as appropriate. There are two symbol variants for each proper nullable. With a maximum of two proper nullables for each piece, each with two variants, there are at most four combinations. A rule for one of these combinations is called a factored rule, or a factor.
In the example in the above display, the original rule (Rule 0) was broken into three pieces. Rule 0 had 5 RHS symbols, and the three pieces contain, respectively, the first two RHS symbols; the third RHS symbol; and the last two (that is, 4th and 5th) RHS symbols.
The first piece of Rule 0 is factored into four rules: Rules 15 to 18. The second piece is factored into three rules: Rules 19 to 21. The third and last piece of Rule 0 is also factored into three rules: Rules 22 to 24.
When a rule is broken into pieces, the left hand side of the first piece is the left hand side of the original rule. New symbols are introduced to be the left hand sides of the other pieces. The names of the new LHS symbols are formed by suffixing a tag to the name of the original left hand side. The suffixed tag begins with a capital "R" and identifies the location of the piece in the original rule. For example, the tag "[R0:3]
" indicates that this symbol is the left hand side for the piece of Rule 0 which begins at right hand symbol 3 (the first symbol is symbol 0).
When a new LHS symbol is created for a piece, the worst case is that the new LHS is also a proper nullable. This worst case occurs when all the original symbols in the piece for the new LHS and all symbols in all subsequent pieces are proper nullables.
A newly created LHS symbol will always appear on the RHS of the previous piece. If the newly created symbol is a proper nullable, then it counts against the limit of two proper nullables for that previous piece, and it must be factored, just the same as if it was one of the proper nullables appearing in the original rule.
Rule 0 is such a worst case. The last three symbols of the right hand side are all proper nullables. That means that all the symbols in the last two pieces of the original rule are proper nullables. Therefore both of the newly created LHS symbols are proper nullables.
The original Rule 0 has 4 proper nullables. After the CHAF rewrite, there are 6 proper nullables: the original 4 plus the 2 symbols newly created to serve as left hand sides. This is why, in order to have at most 2 proper nullables per piece, the original rule must to be divided into 3 pieces.
CHAF preserves user semantics. Marpa, when it splits the rule into pieces and factors the pieces, inserts logic to gather and preserve the values of child nodes. The user's semantic actions see the original child values as if the CHAF rewrite had never occurred.
Converting Sequence Productions to BNF
Internally, Marpa converts productions specified as sequences into BNF productions. The conversion is done in a standard way. For example,
{
lhs => 'statements',
rhs => [qw/statement/],
separator => 'comma',
min => 1
}
becomes
2: statements -> statements[Subseq:8:5] /* vrhs discard_sep real=0 */
3: statements -> statements[Subseq:8:5] comma /* vrhs discard_sep real=1 */
4: statements[Subseq:8:5] -> statement /* vlhs real=1 */
5: statements[Subseq:8:5] -> statements[Subseq:8:5] comma statement /* vlhs vrhs real=2 */
In the added symbol, the tag "[Subseq:8:5]
" indicates a special symbol introduced to serve as the LHS of the sequence. The "8:5
" indentifies the sequence using the symbol ID's of the two symbols involved. "statements
", the LHS symbol, has symbol ID 8. "statement
", the RHS symbol, has symbol ID 5.
Here's another example, this time of a sequence without a separator. The rule
{ lhs => 'block', rhs => [qw/statements/], min => 0 },
is rewritten by Marpa as follows:
7: block -> /* empty !used */
8: block -> block[Subseq:0:8] /* vrhs real=0 */
9: block[Subseq:0:8] -> statements /* vlhs real=1 */
10: block[Subseq:0:8] -> block[Subseq:0:8] statements /* vlhs vrhs real=1 */
Note that rule 7, the empty rule with the block
symbol as its LHS, is marked "!used
". block
is a proper nullable, and rules from sequence conversion undergo the same rewriting of proper nullables as any other rules. In this case a nulling variant of block
(block[]
) was created. That made the empty rule created by the sequence conversion useless, and that is why it was marked "!used
".
EARLEY SETS
Creating the Earley sets is not parsing in the strict sense. Creating the Earley sets is recognition -- determining whether or not a grammar recognizes an input. During recognition, Marpa creates Earley items and puts them into Earley sets. The input is recognized if an Earley item of the correct form is in the Earley set located at the end of parsing. In Marpa, parsing in the strict sense takes place after the recognition phase is done, using the Earley sets built by the recognizer.
Every Earley item has a current earleme. An Earley item's current earleme is also called its dot earleme, its current location, or its dot location. An Earley set is the set of all the Earley items with the same dot location.
In the default, token-stream, input model, the earleme location is exactly the same as the zero-based token stream position. Marpa allows other input models, and in those models the term earleme can take on other meanings. User interested in the alternative input models should look at Marpa::XS::Advanced::Models.
In addition to its current earleme, each Earley item has a AHFA state, and an origin. The origin of an Earley item can also be called its start location. Here's a representation of an Earley item, as you might see it in debugging output from Marpa:
S5@2-3
This Earley item is for AHFA state 5 (that is the "S5
" part). The "@2-3
" part says that this Earley item originates at earleme 2, and is current at earleme 3. The number of an Earley item's current earleme is always the same as number of the Earley set that contains the item.
(Note to experts: Those familiar with Earley parsing will note that S5@2-3
does not look like a traditional Earley item. AHFA states are not traditional -- they come from Aycock and Horspool. The use of the at-sign as punctuation is from Grune and Jacobs. And I have added the current earleme to the notation.)
AHFA STATES
To understand what is going on in the Earley items, it will be necessary to understand AHFA's (Aycock-Horspool Finite Automata). Let's look at an Earley item:
S3@0-3
This states that this item is for AHFA state 3; that it originates at earleme 0; and that it is currently at (or has its dot position at) earleme 3. We can get a description of the AHFA states using the show_AHFA
method. (For those who like the big picture, the entire show_AHFA
output, along with the other trace and debugging outputs for this example, is in the appendices.)
Here's what show_AHFA
says about AHFA state 3:
S3:
Expression -> Term .
Term -> Term . Add Term
<Add> => S6; S7
The first line of show_AHFA
's description of AHFA state 3 begins with its name: S3
. Marpa uses AHFA states to track the parse. Every AHFA state has one or more LR(0) items. In the above display, the second and third lines are LR(0) items. LR(0) items are rules with a dot added to indicate how far recognition has proceeded into the rule.
A dot between two symbols indicates that all symbols before the dot have been recognized, but that none of the symbols after the dot have been recognized. A dot at the beginning of an LR(0) item indicates that none of its symbols have yet been recognized. A dot at the end indicates that the rule is complete -- that all of the rule's symbols have been recognized.
The location of the dot is called the dot position. The symbol before the dot position in an LR(0) item is called the pre-dot symbol. The symbol after the dot position in an LR(0) item is called the post-dot symbol.
The last line in the above display shows the transitions from AHFA state S3. It says that, when Marpa sees an Add
token, it will transition both to AHFA state S6 and to AHFA state S7. We will use both transitions in our examples.
It's important not to confuse Earley items with LR(0) items. Earley items are built from one or more LR(0) items. In traditional Earley parsing, each Earley item contained one and only one LR(0) item. This made the internals simpler, but was not as efficient. Marpa combines LR(0) items into AHFA states based on the ideas in Aycock and Horspool 2002.
Marpa places an important restriction on LR(0) items. The post-dot symbol is never a nullable symbol. In a completed rule, the dot position will be at the very end of a rule, after all symbols. When an LR(0) item is not a completed rule, the post-dot symbol will always be non-nullable. Intuitively, it may be helpful to think of Marpa as recognizing nulling symbols instantly, whenever they are encountered, so that the dot position never stops in front of a nulling symbol.
Because nulling symbols are never post-dot symbols, the dot position is tracked in terms of the non-nulling symbols only. Nulling symbols are skipped over as if they didn't exist. This document speaks of the non-null symbol position in a rule, or simply of the non-null position. Recognition of a rule starts with the dot position at the first non-null position. An LR(0) item with the dot position at the first non-null position is called a rule initialization.
Beginning from the rule initialization, as each post-dot symbol is recognized, the dot position advances to the next non-null position. Once all of the zero or more symbols are recognized, the LR(0) item becomes a completed rule. When an LR(0) item is not a completed rule, it is called an incomplete LR(0) item, or an incomplete rule.
Every AHFA state contains one or more LR(0) items. This means that every AHFA state is a set of statements about rule recognition. An Earley item is present in the Earley sets if and only if every rule in every LR(0) item in the AHFA state of the Earley item has been recognized as far as its dot position. The origin of each LR(0) item corresponds to the origin of the Earley item, and the dot position of each LR(0) item corresponds to the dot location of the Earley item.
HOW EARLEY SETS ARE BUILT
New items come into the Earley sets in four ways: scanning, prediction, completion, and initialization. Scanning is the addition of Earley items to represent the recognition of symbols directly in the input. Rule completion (or simply completion) occurs when one of the rules of the grammar is fully recognized, or "completed". Rule completions add Earley items to represent the recognition of symbols that are not found directly in the input, but that are found indirectly by applying the rules of the grammar.
Prediction adds Earley items to represent rules that might start at a given location. Initialization adds one or two Earley items to Earley set 0 to represent the start rules.
Scanning
Scanning adds Earley items to indicate which tokens have been recognized in the input, and where. Let's look at the case where we are at earleme 3; we are about to recognize an Add
token; and we have in our Earley set 3 the Earley item S3@0-3. S3@0-3 is the Earley item we already looked at.
The current earleme for S3@0-3 is earleme 3. Here, again, is the information for AHFA state S3, the AHFA state in Earley item S3@0-3.
S3:
Expression -> Term .
Term -> Term . Add Term
<Add> => S6; S7
Marpa knows that it can instruct the lexer to look for a Add
token at earleme 3, because in AHFA state S3, there is a transition on an Add
token. In this example, an Add
token corresponds to a plus sign ("+
").
Marpa's state machine is not fully deterministic, it is quasi-deterministic. That is, Marpa's state machine is not a DFA (Deterministic Finite Automata). A DFA state, for any given token, has at most one transition. Marpa's AHFA states, for any given token, can have up to two transitions. In the case of AHFA state S3, on an Add
token there are two transitions: one to AHFA state S6 and one to AHFA state S7. We will use both transitions in our examples.
The transition to AHFA state S7 is an example of adding a predicted Earley item to an Earley set. The transition to AHFA state S6 is an example of adding a scanned Earley item to an Earley set. At this point, we are discussing scanning, so the first transition that we will look at is the transition to AHFA state 6.
S6:
Term -> Term Add . Term
<Term> => S10; leo(Term)
Which Earley set does the new Earley item for AHFA state S6 go into? In this example, the Add
token has a length (in earlemes) of 1. The current earleme for S3@0-3 is earleme 3. Since 3+1 equals 4, the current earleme for the new Earley item will be earleme 4.
The Earley item containing the transition and from-state used to create a scanned Earley item is called the predecessor of the scanned Earley item. In this example the predecessor of the scanned Earley item is the Earley item S3@0-3. The origin of a scanned Earley item is always the same as the origin of its predecessor.
Collecting what we've determined so far, the new, scanned, Earley item will have its dot location at earleme 4, will have AHFA state S6, and will have its origin at earleme 0. This means that the new, scanned, Earley item will be S6@0-4, and will look like this:
S6@0-4 [p=S3@0-3; s=Add; t=\'+']
Any item which is added to an Earley set based on a AHFA state transition from a predecessor, is called that predecessor's successor. In this example, S6@0-4 is the successor of S3@0-3.
In the current example, the token values are "42
", "*
", "7
", "+
and "1
". The current earleme for Earley item S6@0-4 is earleme 4, indicating that the 4th token has just been seen. The fourth token value is a plus sign ("+
"), which the lexer of this example read as an Add
token.
After the name of the Earley item is an Earley item source choice, shown in brackets. When the context makes the meaning clear, an Earley item source choice may be called a source or a source choice. In this example, the source choice is [p=S3@0-3; s=Add; t=\'+']
. This parse is not ambiguous, so all Earley items have at most one source choice entry.
In this case the source choice entry is for a token choice. Token choices are shown as [p=
predecessor; s=
token_name; t=
token_value]
3-tuples. The token choice [p=S3@0-3; s=Add; t=\'+']
indicates that the scanned token was an Add
symbol, that the scanned token value is "'+'
", and that the predecessor of S6@0-4 is S3@0-3.
Token Lengths
Above, we calculated the location of the new Earley item assuming that the Add
token had a length, in earlemes, of 1. Where did this token length come from?
A length of 1 is the default length for a token. In fact, in the default, token-stream, model of input, a token length of 1 is the only token length possible. When other, non-standard, input models are used, tokens can be variable length. For details on alternative input models, see Marpa::XS::Advanced::Models.
Prediction
In the previous section, we looked at one of the two transitions from AHFA state S3 at earleme 3 on an Add
token. The transition we looked at was to AHFA state S6. Now we will look at the other transition, to AHFA state S7. Here is AHFA state S7:
* S7: predict
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S4
<Number> => S5
<Term> => S11
In the first line of the show_AHFA
output for AHFA state S7, immediately after the name of the state, comes the string "predict"
. This means that S7 is a predicted state, and that the new Earley item we add as a result of the transition to S7 will be a predicted Earley item.
The current location of a predicted Earley item is determined in the same way as the current location of a scanned item. We take the current location of the Earley item containing the transition and add the length of the token causing the transition. In this case the transition is in S3@0-3, whose current location is 3. The token is the same as it was in the previous section: an Add
token with a token length of 1. As before, 3+1 equals 4, so that the new current location will be 4.
Predicted Earley items determine their start location, or origin, in a special way. The origin of a predicted Earley item is always the same as its current location. So the origin of this new, predicted, Earley item will be 4.
Now that we know its AHFA state, current location and origin, we are in a position to add the predicted Earley item. It goes into Earley set 4, and looks like this:
S7@4-4
Note that no source choice is shown. Source choices are not recorded for predicted items.
More about Predicted States
Predicted states exist because the presence of incomplete LR(0) items at an earleme location gives rise to rule initializations at that same earleme location. When an LR(0) item with a post-dot symbol is present at a given location, we call that symbol an expected symbol at that location. At any given location, for every expected symbol, we can also expect the rule initializations for all the rules with that expected symbol on the left-hand side.
Any rule initialization we introduce will have a post-dot symbol. If that post-dot symbol is not already an expected symbol, it may cause further rule initializations to be expected at the same location. In other words, rule initializations recurse.
The AHFA states invented by Aycock and Horspool handle these recursive rule initializations efficiently. Above we looked at AHFA states S6 and S7. In the following paragraphs, we'll examine the recursive sequence of rule initializations as it originates in S6 and is collected in S7.
If an AHFA state is not a predicted AHFA state, then it is a discovered AHFA state. The idea behind these terms is that in predicted states, no symbols for any of the rules have actually been seen. In discovered states, at least one symbol in every rule was found (or "discovered") in the input.
The AHFA's invented by Aycock and Horspool pair every transition to a predicted state with a transition to a discovered state. In this case, S6 and S7 are a pair, where S6 is the discovered state. S6 contains only one LR(0) item:
Term -> Term Add . Term
In this LR(0) item the post-dot symbol is Term
. S7, the predicted AHFA state paired with discovered AHFA state S6, will contain all the rule initializations caused by the expectation of a Term
symbol. There are two rules with Term
on their left hand side. Their rule initializations are
Term -> . Term Add Term
Term -> . Factor
Rule initializations recurse. The rule initializations above have two post-dot symbols: Term
and Factor
. Term
is already an expected symbol, but Factor
is a new expected symbol. When Factor
is an expected symbol at an earleme location, all rules with Factor
on their left hand side will have rule initializations at that same earleme location. There are two such rules. Here are their rule initializations:
Factor -> . Factor Multiply Factor
Factor -> . Number
The post-dot symbols in these rule initializations are Factor
and Number
. Factor
is already an expected symbol. Number
is new as an expected symbol, but there is no rule in our grammar with Number
on its left hand side.
The recursion is finished. We have added all the rule initializations for the expected symbols found in previous steps. There are no new expected symbols in this recursion step. That means there will be no more new rule initializations, and therefore no more new expected symbols.
Starting with the expected symbol in AHFA state S6, we found four rule initializations. The predicted state paired with S6 is S7. Let's look at it again:
* S7: predict
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S4
<Number> => S5
<Term> => S11
We see that, indeed, S7 contains all four of the rule initializations that we found in this section.
Completion
Above, we saw a scanned Earley item, added as the result of recognizing a symbol directly in the input. New symbols are also recognized indirectly, when rules are completed. When a rule is recognized all the way to the end, so that the dot location is after the rightmost right hand side symbol, then the left hand side symbol of that rule is also recognized.
As a reminder, an LR(0) item with the dot at the end of the rule is called a completed rule. Whenever an Earley item contains a completed rule, that indicates that the left hand side symbol of the completed rule was recognized. The left hand side symbol of a completed rule is a completed symbol.
Earley items with completed symbols can cause other Earley items to be added to the Earley sets. Let's look at one example. S5@2-3, in Earley set 3, contains a completed symbol. Here is what AHFA state 5 looks like:
S5:
Factor -> Number .
Every completed symbol has an origin and a current earleme. The origin and the current earleme of a completed symbol are exactly the same as the origin and the current earleme of the Earley item that completes that symbol. In this case, the Earley item completing the symbol Factor
is S5@2-3 so that Factor
is a completed symbol with origin at 2 and current earleme at 3.
An Earley item that completes a symbol is called a completion Earley item, or simply a completion. As a result of recognizing a completion Earley item, Marpa may add one or more completion effect Earley items to the Earley sets. A completion effect Earley item is often called simply a completion effect, or just an effect.
When a completion effect Earley item is added to the Earley sets, the completion Earley item which caused the effect is called a completion cause Earley item. A completion cause Earley item is often called simply a completion cause, or just an cause.
For a completion effect Earley item to be added to the Earley sets, a completion cause is necessary but not sufficient. There must also be a matching predecessor Earley item. A matching predecessor Earley item is an Earley item whose origin is the same as the origin of the completed symbol, and whose the post-dot symbol is the completed symbol.
In our example, Factor
is a completed symbol with origin 2. We will add an Earley item as a completion effect if we find a matching predecessor: an Earley item in Earley set 2 which has Factor
as one of the post-dot symbols.
In this example we find two matching predecessors: S8@0-2 and S9@2-2. Completion effects are added for both of these. We'll look only at the completion effect that has S8@0-2 as its matching predecessor. Here is AHFA state 8.
* S8:
Factor -> Factor Multiply . Factor
<Factor> => S12; leo(Factor)
The transition for Factor
in AHFA state S8 is to S12, so the new completion effect will have AHFA state S12. A completion effect always has the same current earleme as its completion cause (in this case S5@2-3). The origin of a completion effect is always the same as the origin of its matching predecessor (in this case S8@0-2). The means our new new completion effect will have AHFA state S12, origin 0 and current earleme 3. Here it is:
S12@0-3 [p=S8@0-2; c=S5@2-3]
Note the source choice in brackets after the name: [p=S8@0-2; c=S5@2-3]
. Previously we saw a token source choice. This is another kind of source choice, a completion source choice. This source choice says the reason for S12@0-3 to be in the Earley sets is the completion cause S5@2-3 and its matching predecessor, S8@0-2.
When a parse is ambiguous, Earley items can have multiple source choices. In this example, the parse is unambiguous, and all scanned Earley items and all completion effect Earley items will have one and only one source choice.
Completions recurse. If a completion effect contains a completed symbol, and if that completion effect has a matching predecessor, then it is also a completion cause.
The completion effect that we just added (S12@0-3) is a completion cause as well. It is a step in a recursive sequence of completions. Earley item S12@0-3 has AHFA state S12. AHFA state S12 contains a completed rule:
* S12: leo-c
Factor -> Factor Multiply Factor .
(Ignore the "leo-c" notation for now. It indicates that S12 is what's called a Leo completion. The Leo logic is dealt in another document.)
The completed symbol is Factor
, with origin 0 and current earleme 3. Looking for matching predecessors at earleme 0, we find one: S1@0-0. Here is what AHFA state 1 looks like:
* S1: predict
Expression -> . Term
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S4
<Number> => S5
<Term> => S3
The completed symbol is Factor
and its transition in S1 is to S4, so we add S4@0-3 to Earley set 3.
S4@0-3 [p=S1@0-0; c=S12@0-3]
In our scanning example, we referred to S3@0-3. One more step in the sequence of completions and we will see S3@0-3 again. The origin of S4@0-3 is at earleme 0. Term
is only the completed symbol in AHFA state 4.
* S4:
Term -> Factor .
Factor -> Factor . Multiply Factor
<Multiply> => S8; S9
S1@0-0 contains an LR(0) item which has Term
as the post-dot symbol. In AHFA state S1, the transition on Term
is to AHFA state S3. This means S3@0-3 will be added to Earley set 3 as a completion effect, with S4@0-3 as the cause and S1@0-0 as the matching predecessor:
S3@0-3 [p=S1@0-0; c=S4@0-3]
We call a completion effect Earley item the successor of its matching predecessor. For example, S3@0-3 is the successor of S1@0-0. The matching predecessor of an Earley item is often simply called its predecessor. For example, S1@0-0 is the predecessor of S3@0-3.
Initialization
One or two Earley items are put into Earley set 0 to start things off. In our example, the initial Earley items are
S0@0-0
S1@0-0
S0@0-0 will always be present in Earley set 0. AHFA state 0 contains the start rules, with the dot pointer at the beginning. Only Earley set 0 will contain an Earley item for AHFA state 0. S1@0-0 contains the rules predicted by S0@0-0.
EVALUATION
This section discusses matters relevant to Marpa's Generation 3 evaluator.
Applicable Dotted Rules
As we have seen, Marpa uses AHFA states in its Earley items, and these can contain more than one dotted rule. To produce an actual parse result, Marpa must find the dotted rules which are applicable for that particular parse. A dotted rule that applies in a particular parse result is called an applicable dotted rule or an applicable LR(0) item.
The search for dotted rules applicable to a parse result proceeds top-down, starting with the applicable dotted rule in the start Earley item. There will always be exactly one dotted rule in a start Earley item, and that one dotted rule will always be applicable to the current parse result.
If the parse is not ambiguous, there will be exactly one applicable dotted rule for every Earley item. If the parse is ambiguous, at some points there may be more than one applicable dotted rule. Ambiguous parsing is discussed in greater depth in the next section.
To find applicable dotted rules while proceeding top-down, Marpa must be able to do two things. First, given a pairing of a completion effect with its completion cause, and an applicable dotted rule in the completion effect, Marpa must be able to determine the applicable dotted rules in the completion cause. A dotted rule in a completion cause is applicable if and only if the LHS symbol of the dotted rule in the completion cause is the same as the first non-null symbol before the dot position of the applicable dotted rule in the completion effect.
Second, given a pairing of a successor Earley item with its predecessor Earley item, and an applicable dotted rule in the successor, Marpa must be able to determine the applicable dotted rules in the predecessor Earley item. Recall that a dotted rule is a duple containing a production of the grammar and a dot position in that production. A dotted rule in a predecessor is applicable if and only if
its production is the same as the production of the applicable dotted rule in the successor, and
its dot position is exactly one non-null symbol earlier than the dot position of the applicable dotted rule in the successor.
Ambiguous Parsing
The parse in this example is not ambiguous. Marpa allows ambiguous parses. When a parse is ambiguous, the procedure for adding Earley items is the same as above, except when Marpa attempts to add the same Earley item twice.
Two Earley items are considered "the same" if they have the same AHFA state, the same origin, and the same current earleme. Marpa will never add the same Earley item twice.
In cases where Marpa might add the same Earley item twice, Marpa instead adds another source choice to the already existing Earley item. This means that some Earley items might have multiple source choices. It is possible for an Earley item to end up with
Both a token source choice and a completion source choice;
Multiple token source choices;
Multiple completion source choices;
Multiple token source choices and multiple completion source choices;
A parse is ambiguous if and only if it has one or more points of ambiguity. A point of ambiguity is an Earley item with two or more source choices, or an Earley item with two or more applicable dotted rules.
Calculating the Parse Result
To calculate the parse result, Marpa creates a parse tree from the Earley sets, using the applicable dotted rule logic as described above. For the successive values of an ambiguouos parse, the parse tree must be iterated. The process for this is complex, and will not be dealt with here.
For the evaluation of each parse tree, Marpa uses an evaluation stack, which is the standard method. The evaluation stack is initialized to empty. Every time Marpa evaluates a tree node, it pushes the result onto the evaluation stack. When a tree node has child values, they are popped from the evaluation stack. The calculation of the parse result is complete when Marpa pushes the value of the root tree-node onto the evaluation stack. This value becomes the value of the parse. The trace_values
output in the appendix traces the calculation of the parse result in our example.
Optimizations
In some very important cases, the naive implementation of the traditional method of evaluating a parse tree is extremely wasteful. Grammars of great practical interest often contain very long sequences. A C language "translation unit" is defined as sequence of definitions and declarations, and that sequence can be extremely long. A Perl lexical unit is also a sequence, again, often quite long.
When a sequence is expressed as BNF rules, most of the rule instances used to implement a long sequence will have no significant semantics of their own. Popping and pushing the symbols for each of these rule instances is wasteful. It is more efficient to leave the values of the sequence items on the evaluation stack until the sequence is complete.
To optimize sequences and other cases where symbols are best left on the stack, Marpa marks some of its internal symbols as "virtual". A symbol is virtual if it has no "real" semantics. In the presence of virtual symbols, Marpa skips much of the usual popping and pushing of values on the evaluation stack.
Marpa uses three special rule properties to track virtual symbols: vlhs
, vrhs
and real
. vlhs
means that the left hand side is a virtual symbol. vrhs
means that the right hand side contains a virtual symbol. real
is an integer, indicating the number of real (that is, non-virtual) symbols on the right hand side.
THE LEO LOGIC
Marpa uses the ideas from an algorithm developed by Joop Leo to deal with right recursion. The Leo logic is complex enough to merit its own example, and so is presented in a separate document.
DETERMINING WHETHER A PARSE IS SUCCESSFUL
As mentioned, in the strict sense, Earley's algorithm is just a recognizer. The input is recognized successfully if there is an Earley item for a completed start rule state in the Earley set at the end of parsing. A completed start rule state is a AHFA state containing a completed start rule. A completed start rule is an LR(0) item for a start rule with a dot position at the end of the rule.
A start rule is any rule that has Marpa's internal start symbol on its left hand side. There are at most two start rules in any Marpa grammar: the empty start rule and the non-empty start rule. As a consequence, there are at most two completed start rule states in any Marpa grammar: the empty completed start rule state and the non-empty completed start rule state.
The empty start rule is the rule with Marpa's internal start symbol on its left hand side and an empty right hand side. There is an empty start rule in a Marpa grammar if and only if that Marpa grammar allows the null parse. (Saying that a grammar allows the null parse is the same as saying that a grammar recognizes the zero-length, or empty, token string.) In grammars which allow a null parse, the start state, AHFA state 0, is the empty completed start rule state.
The non-empty start rule is the rule with Marpa's internal start symbol on its left hand side and exactly one symbol on its right hand side -- the grammar's original start symbol. AHFA state S2 is the non-empty completed start rule state for our example parse:
* S2: leo-c
Expression['] -> Expression .
A parse is successful at earleme N if it contains an Earley item for a completed start rule state with a dot earleme of N. The parse in our example is successful at earleme location 5, because the following Earley item appears in Earley set 5:
S2@0-5 [p=S0@0-0; c=S3@0-5]
A parse is successful if that parse is successful at its end of parsing location. Marpa's default end of parsing location is the end location of the last token read. In the default input model, if N tokens were read, the end of parsing location will be earleme N. (The end of parsing location can be also be explicitly set by the user.)
The parse in our example reads five tokens using the default input model, and it uses the default end of parsing. That means that the end of parsing for our example parse is at earleme location 5. Since parsing is successful at location 5, the parse in our example is successful.
APPENDIX: THE EXAMPLE
Below are the code and the trace outputs for the example used in this document.
Code for the example
my $grammar = Marpa::XS::Grammar->new(
{ start => 'Expression',
actions => 'My_Actions',
default_action => 'first_arg',
strip => 0,
rules => [
{ lhs => 'Expression', rhs => [qw/Term/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
}
);
$grammar->precompute();
my $recce = Marpa::XS::Recognizer->new( { grammar => $grammar } );
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', q{*} ],
[ 'Number', 1 ],
[ 'Add', q{+} ],
[ 'Number', 7 ],
);
for my $token_and_value (@tokens) {
$recce->read( @{$token_and_value} );
}
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $value_ref = $recce->value();
my $value = $value_ref ? ${$value_ref} : 'No Parse';
show_symbols
Output
0: Expression, lhs=[0] rhs=[5] terminal
1: Term, lhs=[1 3] rhs=[0 3] terminal
2: Factor, lhs=[2 4] rhs=[1 4] terminal
3: Number, lhs=[] rhs=[2] terminal
4: Add, lhs=[] rhs=[3] terminal
5: Multiply, lhs=[] rhs=[4] terminal
6: Expression['], lhs=[5] rhs=[]
show_rules
Output
0: Expression -> Term
1: Term -> Factor
2: Factor -> Number
3: Term -> Term Add Term
4: Factor -> Factor Multiply Factor
5: Expression['] -> Expression /* vlhs real=1 */
show_AHFA
Output
* S0:
Expression['] -> . Expression
<Expression> => S2; leo(Expression['])
* S1: predict
Expression -> . Term
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S4
<Number> => S5
<Term> => S3
* S2: leo-c
Expression['] -> Expression .
* S3:
Expression -> Term .
Term -> Term . Add Term
<Add> => S6; S7
* S4:
Term -> Factor .
Factor -> Factor . Multiply Factor
<Multiply> => S8; S9
* S5:
Factor -> Number .
* S6:
Term -> Term Add . Term
<Term> => S10; leo(Term)
* S7: predict
Term -> . Factor
Factor -> . Number
Term -> . Term Add Term
Factor -> . Factor Multiply Factor
<Factor> => S4
<Number> => S5
<Term> => S11
* S8:
Factor -> Factor Multiply . Factor
<Factor> => S12; leo(Factor)
* S9: predict
Factor -> . Number
Factor -> . Factor Multiply Factor
<Factor> => S13
<Number> => S5
* S10: leo-c
Term -> Term Add Term .
* S11:
Term -> Term . Add Term
<Add> => S6; S7
* S12: leo-c
Factor -> Factor Multiply Factor .
* S13:
Factor -> Factor . Multiply Factor
<Multiply> => S8; S9
show_earley_sets
Output
Last Completed: 5; Furthest: 5
Earley Set 0
S0@0-0
S1@0-0
Earley Set 1
S5@0-1 [p=S1@0-0; s=Number; t=\42]
S4@0-1 [p=S1@0-0; c=S5@0-1]
S3@0-1 [p=S1@0-0; c=S4@0-1]
S2@0-1 [p=S0@0-0; c=S3@0-1]
Earley Set 2
S8@0-2 [p=S4@0-1; s=Multiply; t=\'*']
S9@2-2
Earley Set 3
S5@2-3 [p=S9@2-2; s=Number; t=\1]
S12@0-3 [p=S8@0-2; c=S5@2-3]
S13@2-3 [p=S9@2-2; c=S5@2-3]
S4@0-3 [p=S1@0-0; c=S12@0-3]
S3@0-3 [p=S1@0-0; c=S4@0-3]
S2@0-3 [p=S0@0-0; c=S3@0-3]
Earley Set 4
S6@0-4 [p=S3@0-3; s=Add; t=\'+']
S7@4-4
Earley Set 5
S5@4-5 [p=S7@4-4; s=Number; t=\7]
S4@4-5 [p=S7@4-4; c=S5@4-5]
S10@0-5 [p=S6@0-4; c=S4@4-5]
S11@4-5 [p=S7@4-4; c=S4@4-5]
S3@0-5 [p=S1@0-0; c=S10@0-5]
S2@0-5 [p=S0@0-0; c=S3@0-5]
trace_values
Output
Pushed value from a12 T@0-1_Number: Number = \42
Popping 1 values to evaluate a12 T@0-1_Number, rule: 2: Factor -> Number
Calculated and pushed value: 42
Pushed value from a10 R4:1@0-1T@1-2_Multiply: Multiply = \'*'
Pushed value from a9 T@2-3_Number: Number = \1
Popping 1 values to evaluate a9 T@2-3_Number, rule: 2: Factor -> Number
Calculated and pushed value: 1
Popping 3 values to evaluate a8 R4:2@0-2F2@2-3, rule: 4: Factor -> Factor Multiply Factor
Calculated and pushed value: 42
Popping 1 values to evaluate a7 F4@0-3, rule: 1: Term -> Factor
Calculated and pushed value: 42
Pushed value from a5 R3:1@0-3T@3-4_Add: Add = \'+'
Pushed value from a4 T@4-5_Number: Number = \7
Popping 1 values to evaluate a4 T@4-5_Number, rule: 2: Factor -> Number
Calculated and pushed value: 7
Popping 1 values to evaluate a3 F2@4-5, rule: 1: Term -> Factor
Calculated and pushed value: 7
Popping 3 values to evaluate a2 R3:2@0-4F1@4-5, rule: 3: Term -> Term Add Term
Calculated and pushed value: 49
Popping 1 values to evaluate a1 F3@0-5, rule: 0: Expression -> Term
Calculated and pushed value: 49
New Virtual Rule: a0 F0@0-5, rule: 5: Expression['] -> Expression
Symbol count is 1, now 1 rules
COPYRIGHT AND LICENSE
Copyright 2010 Jeffrey Kegler
This file is part of Marpa::XS. Marpa::XS is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::XS is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::XS. If not, see
http://www.gnu.org/licenses/.