NAME
Parse::Marpa::Doc::Internals - Marpa Internals
OVERVIEW
Knowledge of Marpa internals is not necessary to use Marpa. But you can make better use of Marpa if you understand how it works. This document assumes that you have read the other Marpa documentation, and that you have a general knowledge of parsing.
EARLEY SETS
To speak pedantically, the algorithm in Earley's 1970 parsing article does not actually parse in the strict sense. It is a recognizer, not a parser.
Earley items are built as tokens are recognized. If an Earley item of the correct form is built when all of an input has been processed, that input is recognized as one of those described by the grammar. Once the recognition phase is done, the phase that does parsing in the strict sense uses the Earley items.
Every Earley item has a current earleme. An Earley item's current earleme is also called its dot earleme, its current location, or its dot location. The set of all the Earley items with the same dot location, is called an Earley set.
In addition to its current earleme, each Earley item has a QDFA state, and an origin. The origin of an Earley item can also be called its start location. Here's a representation of an Earley item, as you might see it in debugging output from Marpa:
S5@6-7
This Earley item is for QDFA state 5 (that is the "S5
" part), The "@6-7
" part says that this Earley item originates at earleme 6, and is current at earleme 7. The number of an Earley item's current, or dot, earleme is always the same as number of the Earley set that contains the item.
(Note to experts: Those familiar with Earley parsing will note that S5@6-7
does not look like a traditional Earley item. QDFA states are not traditional -- they are my elaboration of an invention by Aycock and Horspool. The use of the at-sign as punctuation follows Grune and Jacobs. Finally, in the traditional notation for an Earley item, the dot earleme is not included at all -- you were supposed to figure out from the context which Earley set an Earley item belongs to.)
QDFA STATES
I will mention QDFA's (Quasi-Deterministic Finite Automata) frequently and NFA's (Non-deterministic Finite Automata) sometimes. All you need to know about NFA's is that the QDFA's are built from them. NFA state numbers sometimes appear in the diagnostic outputs. They can be ignored.
About QDFA's, it will be necessary to know more. Let's start with an example of another Earley item:
S1@2-2
This states that this item is for QDFA state 1; that it originates at earleme 2; and that it is currently at (or has its dot position at) earleme 2. We can get a description of the QDFA states with the show_QDFA
method. Here's what it says about QDFA state 1:
S1: predict; 1,5
e ::= . e op e
e ::= . number
<e> => S3
<number> => S4
For those who like the big picture, or who simply like to skim ahead, the entire show_QDFA
output, along with the other trace and debugging outputs for this example, is in the appendices.
The first line of show_QDFA
's description of QDFA state 1 begins with its name: S1
. Next comes the predict
tag. That indicates this item was created by a prediction. That's one of the four ways an Earley item can be created. There's more on that below.
The two numbers after the "predict
" tag on the first line may be ignored. They are the numbers of the NFA states that were combined into this QDFA state. Marpa uses QDFA states to track the parse. Every QDFA state has one or more LR(0) items. In this example, the second and third lines are LR(0) items. LR(0) items are rules with a dot added to indicate how far recognition has proceeded into the rule.
A dot between two symbols indicates that all symbols before the dot have been recognized, but that none of the symbols after the dot have been recognized. A dot at the beginning of an LR(0) item indicates that none of its symbols have yet been recognized. A dot at the end indicates that the rule is complete -- all of its symbols have been recognized. The location of the dot is called the dot position.
The last two lines in the display show transitions. The first line says that, on seeing an e
, you transition to QDFA state 3. The second states that, on seeing a number
, you transition to QDFA state 4.
It's important not to confuse Earley items with LR(0) items. Earley items are built from one or more LR(0) items. In traditional Earley parsing, each Earley item contained one and only one LR(0) item. This made the internals simpler, but was not as efficient. Marpa combines LR(0) items into QDFA states based on ideas in Aycock and Horspool 2002.
Every LR(0) item is a statement that a rule has been recognized as far as the dot position. The dot can be at the beginning of the rule in an LR(0) item, indicating that the part of the rule recognized so far is zero length. An LR(0) item with the dot at the beginning of the rule is a prediction that the rule might be recognized.
Every QDFA state contain one or more LR(0) items and is therefore a set of statements about rule recognition. An Earley item is present in the Earley sets if and only if every rule in every LR(0) item in the QDFA state of the Earley item has been recognized as far as its dot position.
Earley items fix the location in the earleme stream where LR(0) items begin and end. Recognition of each LR(0) item began at the origin of the Earley item, and ended at the dot location of the Earley item. Put another way, the origin of each LR(0) item corresponds to the origin of the Earley item, and the dot position of each LR(0) item corresponds to the dot location of the Earley item.
Some LR(0) items are predictions -- statements that their rules could start at the dot location of the Earley item. In predictions, no symbols have been recognized, and the origin and the dot location of the Earley item will be the same. Our example is a prediction -- both origin and dot location are at earleme 2. We can see that that is the case from the "2-2
" in the description of Earley item S1@2-2
.
HOW EARLEY SETS ARE BUILT
New items come into the Earley sets in four ways: scanning, completion, prediction, and initialization.
Scanning
Scanning adds Earley items to indicate which tokens have been seen in the input, and where. Suppose the Earley item S1@2-2 (mentioned above) is present at earleme 2. Marpa knows to instruct the lexer to look for a number
token because there is a transition from S1 to S4 on number
.
If the lexer finds a number
token with a length (in earlemes) of 1 at earleme 2, a new Earley item S4@2-3 is added at earleme 3. (Earleme 3 because 3 is 2+1, the current token plus the token length.)
Here (again from show_QDFA
) is the description of QDFA state 4:
S4: 6
e ::= number .
We can see that there is only one LR(0) item in QDFA state 4, and that it has the dot pointer at the end of its rule. An LR(0) item with the dot pointer at the end of its rule is a completed rule.
Marpa calls the item that was looking for the scanned symbol, the predecessor of the item added by the scan. In this example, S1@2-2 is the predecessor of S4@2-3. Any item which is added to an Earley set based on a QDFA state transition from a predecessor, is called that predecessor's successor. In this example, S4@2-3 is the successor of S1@2-2.
An Example
Here's what S4@2-3 looks like in the Earley sets, after a successful scan of the input "2+2"
.
S4@2-3 [p=S1@2-2; t=2]
After the name of the Earley item is an Earley item source choice, shown in brackets. When the context makes the meaning clear, an Earley item source choice may be called a source or a source choice. This parse is not ambiguous, so all Earley items have at most one source choice entry.
In this case the source choice entry is for a token choice. Token choices are shown as [p=
predecessor; t=
token_value]
pairs. In the above example, the token choice [p=S1@2-2; t=2]
indicates that the token has a value of "2", and that, when this token choice is selected, the predecessor of S4@2-3 is S1@2-2.
Completion
Whenever an Earley item with a completed rule is added to an Earley set, other Earley items may be added to that Earley set as a result. The item with the completed rule is called the completion, or the cause, and any new item added as a result is called an effect.
Let's look at one example of a completion "cause and effect". As the completion in our example, we'll use S4@2-3, the Earley item in our example of a scanned item. Its state was QDFA state 4. QDFA state 4 contained the LR(0) item
e ::= number .
The dot is at the far right, so it's a completed rule whose the left hand side is e
.
S4@2-3 begins at earleme 2 and has a complete rule with e
as its lhs, and therefore any Earley item which is looking for an e
beginning at earleme 2 is in luck. S5@0-2 is such a rule. QDFA state 5 looks like this:
S5: 3
e ::= e op . e
<e> => S6
This says that in QDFA state 5, when you see an e
, you transition to QDFA state 6. Here's S6@0-3, the item which is the effect of the completion:
S6@0-3 [p=S5@0-2; c=S4@2-3]
For completion items, the terms predecessor and successor mean pretty much what they did in token scanning. Since S6@0-3 comes from moving the dot forward in a rule in S5@0-2, we say S5@0-2 is the predecessor of S6@0-3, and that S6@0-3 is the successor of S5@0-2.
The same Earley item can be added to the same Earley set for multiple reasons. A rule can be added because of one or more scanned tokens; because of one or more completions; or because of any combination of these.
Traditionally in Earley's algorithm, the same item is not added twice. Where an item comes from is not important in recognition. But parsing in the strict sense means determining the structure of the input, and that requires us to know why an Earley item was added to an Earley set. Each reason for the existence of an Earley item in an Earley set is a different choice that can be made when it comes time to create a parse structure.
Marpa uses the Earley item source choices to track the reasons why Earley items were added. Aycock and Horspool don't call them source choices. But Marpa's method is derived from the one Aycock and Horspool describe in their 2002 article.
We've already seen a token source choice in the scanning example. Earley item S6@0-3 has another example of a source choice: [p=S5@0-2; c=S4@2-3]
. This is another kind of source choice, a completion source choice. This source choice says that one reason for S6@0-3 to exist was the completion in S4@2-3, with S5@0-2 as the predecessor. Since this parse in unambiguous, there are no other source choices.
Prediction
A third source of Earley items is prediction. Whenever an Earley item is expecting to see an e
, for example, it can also expect to see the start of all the rules that have an e
on their left hand side.
Recall QDFA state 5:
S5: 3
e ::= e op . e
<e> => S6
QDFA state 5 is expecting an e
. This means that the rules
e ::= . e op e
e ::= . number
can also be expected to originate at this earleme. Marpa is smart about grouping rules into QDFA states and both these rules are in QDFA state 1:
S1: predict; 1,5
e ::= . e op e
e ::= . number
<e> => S3
<number> => S4
Since S5@0-2 is at earleme 2, and it is expecting the rules which originate in QDFA state 1, there will be an Earley item for QDFA state 1 at earleme 2:
S1@2-2
In predicted items, either the dot is at the beginning of the rule, or else all the symbols to the left of the dot are nulling symbols. Either way, the origin and dot earlemes will always be the same.
Source choices are not recorded for predicted items. Predicted items are essential to get rules started in the Earley sets. But they are not actually needed when the time comes for parsing and evaluation. That means they don't need source choice entries.
Initialization
One or two Earley items are put into Earley set 0 to start things off. In our example, the initial Earley items are
S0@0-0
S1@0-0
QDFA state 0 contains the start rules, with the dot pointer at the beginning. Only Earley set 0 will contain an Earley item for QDFA state 0. S1@0-0 contains the rules predicted by S0@0-0.
HOW A SUCCESSFUL PARSE IS RECOGNIZED
As mentioned, what's usually called Earley's algorithm is just a recognizer, an algorithm to build the Earley sets. The input is recognized successfully if, at the earleme location designated as the end of parsing, there is an Earley item for a completed start rule state. A completed start rule state is a QDFA state containing a completed start rule. A completed start rule is the LR(0) item for the start rule that has its dot position at the end of that rule.
At most two of the QDFA states can be completed start rule states. One is a special case. In grammars which allow a null parse, the start state, QDFA state 0, is also a completed start rule state.
The other completed start rule state contains the LR(0) item which has Marpa's internal start rule with the dot pointer at the end of the rule. The lhs of the internal start rule is a special internal start symbol. The rhs of the internal start rule is the grammar's original start symbol. QDFA state 2 is the completed start rule state for our example parse:
S2: 8
e['] ::= e .
A parse starting at earleme S is successful at earleme N if it contains an Earley item for a completed start rule state with a origin earleme of S and a dot earleme of N. Here's the Earley item which makes the parse in our example, which started at earleme 0, successful at earleme 3:
S2@0-3 [p=S0@0-0; c=S6@0-3]
THE PARSE BOCAGE
After a successful parse has been recognized, it can be evaluated. The main data structure of a Marpa evaluator is a parse bocage.
Each Marpa evaluator is created with parse criteria. The parse criteria are the start and end earlemes of the parse, and the parse's top symbol. Default values are often used for these parse criteria. The default start earleme is earleme 0. The default end earleme comes from the default end of parsing which, in offline mode, is usually the furthest earleme reached by any token in the parse. The default top symbol is the start symbol of the grammar.
The parse bocage stores all parses meeting the evaluator's parse criteria. In other general parsers, a parse forest has been used to store ambiguous parses. Marpa's parse bocage is an adaptation of a parse forest. Parse bocages differ from parse forests in three ways: parse bocages may contain cycles; and-nodes and or-nodes strictly alternate in a parse bocage; and in parse bocages and-nodes are limited to two rhs symbols.
Parse forests and parse bocages both consist of and-nodes and or-nodes. In both parse forests and parse bocages, and-nodes contain pieces of parse trees, while or-nodes represent choices among the pieces.
In parse forests, the and-nodes directly represent productions of the original grammar. Marpa's parse bocages contain the original grammar in Marpa Bocage Form (MBF), which represents the original grammar's productions in pieces with at most two symbols on the rhs. In restricting the rhs to at most two symbols, MBF is very similar to Chomsky Normal Form (CNF).
Marpa uses MBF for four reasons. First, MBF is essentially the form in which the grammar is found in the Earley items. Every source choice of an Earley item contains at most two links. These links can be treated as symbols on the rhs of a production. So the parse as found in the Earley items is in a form with at most two symbols on the right hand side.
Second, MBF is more compact than a parse forest. Third, MBF can handle parses with cycles. Fourth, MBF easily produces binary trees. Binary trees are fast and easy to traverse and manipulate. The effort spent to rewrite the parse data found in the Earley sets back into the productions of the original grammar would result in a slower and more cumbersome data structure.
In MBF, every and-node represents either a production of the original grammar, or a piece of one. And-nodes and or-nodes strictly alternate in MBF. Every child of an and-node is an or-node. Every child of an or-node is an and-node.
Types of Or-Nodes
An or-node is trivial if it has only one child and-node. A trivial or-node represents a choice from a list containing only a single alternative. Marpa or-nodes are often trivial.
Each or-node corresponds to a rule, a dot position in that rule, and an earleme location. The symbol before the dot position in a rule is the or-node's pre-dot symbol. The symbol after the dot position in a rule is the or-node's post-dot symbol.
The or-node symbol is its pre-dot symbol. Or-nodes are not created for dot positions at the beginning of rules, so every or-node has an or-node symbol.
An or-node whose dot position is at the end of its rule is a closure or-node. Any other or-node is a kernel or-node. If the post-dot symbol of a kernel or-node is nulling, it is a mortar or-node. Any other kernel or-node, and every closure or-node, is a brick or-node.
The idea behind mortar or-nodes is that they fill the cracks in between brick or-nodes. Mortar or-nodes exist, because Marpa has no LR(0) items with nulling post-dot symbols. Mortar or-nodes are needed because there must be an or-node for every symbol in every rule used in any possible parse. No Earley item matches a mortar or-node.
Matching and Nearest Earley Items
Every brick or-node matches an Earley item, based on its rule, dot position, and earleme location. An or-node is said to match an Earley item if the current location of the Earley item is the same as the or-node's earleme location; and for one of the LR(0) items in the Earley item's QDFA state, the rule and the dot position are the same as the rule and dot position of the or-node.
Marpa LR(0) items cannot have the dot position before a nulling symbol, so that mortar or-nodes don't match any Earley items. An Earley item is the nearest to an or-node if its current location is the same as the earleme location of the or-node; and its QDFA state contains the nearest Marpa LR(0) item to the or-node. The nearest Marpa LR(0) item to an or-node is the one with the same rule as the or-node and the nearest Marpa-correct dot position at or after the or-node's dot position. A dot position is Marpa-correct if it is not in front of a nulling symbol. If there is an Earley item which matches an or-node, that Earley item is also the one nearest to the or-node.
Child Criteria and Child Nodes
Every or-node is a parent of one or more and-nodes. Every and-node is also a parent. Children of and-nodes can be either tokens or or-nodes.
Every or-node has child criteria, which are used, along with the or-node's nearest Earley item, to create the or-node's children. The child criterion for a closure or-node is a symbol, called its child lhs symbol. The child criteria for a kernel or-node are a rule, and a dot position.
In turn, each and-node either has just one child, called a cause, or two children, a predecessor and a cause. The predecessor, if present, always comes first. The cause is always last. When the cause is the only child it is, of course, also first.
A cause can be either a closure or-node or a token. If the parent or-node's symbol is nulling, there will be only one child and-node, and the and-node's cause will be a token created from the null value of the symbol. If the parent or-node's symbol is non-nullable, child and-nodes are created from the source choices of the nearest Earley item. Where the source choice is a token source choice, the and-node's cause will be a token. Where the source choice is a completion source choice, the and-node's cause will be a closure or-node.
For every or-node, there is a set of productions which are used to create the or-node's and-node children, and the cause and predecessor grandchildren. The productions in this set are the applicable rules. When creating and-nodes, Marpa considers every pairing of an applicable rule and a source choice.
Where the parent or-node is a kernel or-node, the rule in the child criteria is the only applicable rule. For closure or-nodes, there may be more than one applicable rule. Where the parent or-node is a closure or-node, every completed rule in the QDFA state of the matching Earley item whose lhs is the same as the child lhs symbol is an applicable rule.
If there are any symbols before a parent or-node's pre-dot symbol, its child and-node will have a predecessor child. This predecessor will be a kernel or-node. As its rule, the predecessor will have one of the applicable rules. If the predecessor's grandparent or-node was a closure or-node, the predecessor or-node's dot position will be in front of the last symbol of the rule. If the predecessor's grandparent or-node was a kernel or-node, the predecessor or-node's dot position will be one less than its grandparent's dot position. That means that the pre-dot symbol of the grandparent or-node will be the post-dot symbol of the predecessor or-node.
Naming Or-Nodes
The name of an or-node is the name of its nearest Earley item, followed by a label describing its child criteria. For a kernel or-node, the label consists of the letter "R", and the number of the or-node's rule and its dot position, colon-separated. The "R" in the label is a mnemonic for "rule".
For example, S3@0-1R0:1 is the name of a kernel or-node. The nearest Earley item is S3@0-1. The child criteria are rule 0 at dot position 1 (e -> e . op e
).
For closure or-nodes, the child criteria label is the letter "L", followed by the number of the child lhs symbol. The "L" in the symbol or-node's label is a mnemonic for "left hand side". An example of a closure or-node is S6@0-3L0. In this case the matching Earley item is S6@0-3, and the child lhs symbol is e
, which was symbol 0. (Symbol numbers for the example may be found in the appendix, in the show_symbols
output.)
Naming And-Nodes
An and-node is always a child of an or-node. The name of an and-node is the name of the parent or-node, followed by a numerical index in square brackets. The child and-nodes are numbered starting at 0. For example, S6@0-3L0[0] is the first child of the or-node named S6@0-3L0.
The Parse Bocage Grammar
Marpa typically displays a parse bocage as a parse bocage grammar (PBG). It may seem pointless to have come this far in the parsing, only to write out another grammar. But there is a big difference between the original grammar and the parse bocage grammar. The original grammar describes all parses which are possible for any input. The PBG describes only those parses which are possible for the specific input which was accepted by the recognizer. In the parse bocage grammar, much of the work of parsing a specific input has been done.
Examples of Parse Bocage Productions
S2@0-3L3 ::= S2@0-3L3[0]
A parse bocage grammar always starts with the start or-production. The above is the start production for our example. The left hand side is an or-node which matches a start rule completion item. The completion start rule state in our example is QDFA state 2 and the input in our example was 3 earlemes long. So the start Earley item is S2@0-3. The start symbol is e[']
, which is symbol number 3. So the child criterion label is "L3".
Start or-productions are always trivial. The rhs of the start or-production is the and-node which is its first and only child. It has index 0.
Or-nodes appear in the PBG as one or more or-productions. Each or-production has the name of an or-node as its lhs. Also, each or-production has the name of one of the or-node's child and-nodes as the only symbol on its rhs.
In many cases, the or-nodes are trivial -- they have exactly one child. Trivial or-nodes have only one or-production in the PBG. In an unambiguous parse, like the parse of our example, all the or-nodes are trivial.
Even when they are non-trivial, or-productions contain little information, and show_bocage
does not show or-productions unless called with the verbose option. Or-productions are necessary for the PBG to be a real BNF grammar, but the information in them is easily deduced from the names of the and-nodes on the lhs of the and-productions.
For every and-node on the lhs of an and-production, there will be always be an or-production with the unbracketed name of the and-node node as the or-production's lhs, and with the and-node itself as the or-production's only rhs symbol. For example, from the name of the and-node S2@0-3L3[0], we can deduce the existence of the or-production in the example above.
Here is the and-production which is the start or-node's only child:
S2@0-3L3[0] ::= S6@0-3L0
rule 2: e['] ::= e .
rhs length = 1; closure
Parse bocage and-productions always have one or two rhs symbols. Either a token or a closure or-node will be on the right hand side. One or the other will always be present, but never both. Additionally, there may be a kernel or-node. If a kernel or-node is present, its name is always the first symbol on the rhs.
In this example, the only symbol on the rhs is a closure or-node: S6@0-3L0. The next line contains a dotted rule which shows the rule and rule position for this and-node.
The last line tells us the length of the rule's rhs, and that there is a Perl closure corresponding to this rule. Perl closures and closure or-nodes are not to be confused. The word "closure" in "closure or-node" refers to its being the last in a rule -- that is, it "closes" the rule. A closure or-node may or may not have a corresponding Perl closure.
S5@0-2R0:2 ::= S5@0-2R0:2[0]
S5@0-2R0:2[0] ::= S3@0-1R0:1 '+'
rule 0: e ::= e op . e
rhs length = 3
This example shows an and-node with a kernel or-node and a token on the rhs of its rule. The and-node is at position 2 of rule 0, as shown by its child criteria label ("R0:2
") and in the dotted rule on the third line. The pre-dot symbol ("op
") matched the token ("+
"). The kernel or-node (S3@0-1R0:1) contains information for deriving the symbols in rule 0 before the pre-dot symbol. In this case there is only one symbol ("e
") before the pre-dot symbol.
S4@0-1L0 ::= S4@0-1L0[0]
S4@0-1L0[0] ::= '2'
rule 1: e ::= number .
rhs length = 1; closure
The and-production above has only a token on its rhs. There is no kernel or-node, because the current symbol for this and-production is the first one of its rule -- in fact, it is the only symbol in its rule.
S4@0-1L0 is a closure or-node with "L0
" as its child criteria. Applicable rules will be found in QDFA state 4, and must be completed rules with symbol number 0 on the lhs. Symbol number 0 is e
. There is only one LR(0) item in QDFA state 4, but it is, as required, a completed rule with e
as its lhs.
THE PARSE TREES
The parse tree for our example parse is shown in an appendix. Every node in a Marpa parse tree comes from an and-node in the parse bocage.
A parse bocage contains all possible parses, but a parse tree is for a specific parse. Only those and-nodes actually used in the specific parse become nodes in the parse tree. In the case of an unambiguous parse, every parse bocage and-node becomes a tree node. In the case of an ambiguous parse, some and-nodes will not be used in the parse tree.
Tree Node: S6@0-3L0[0]; Depth = 1; Rhs Length = 3
Rule: e ::= e op e .
Kernel: S5@0-2R0:2
Closure: S4@2-3L0
Perl Closure: Y
Tree Node: S4@2-3L0[0]; Depth = 2; Rhs Length = 1
Rule: e ::= number .
Perl Closure: Y; Token: '2'
Most of the information in the tree nodes is repeated from the parse bocage and-nodes. The "depth" is the distance of the tree node from the root of the tree.
Marpa's parse trees are kept in pre-order. They can be traversed as a tree, starting from the root. They can also be traversed from the end, which is what Marpa does when it evaluates a parse.
Marpa evaluates a parse tree using an evaluation stack, which is initialized to empty. Every time Marpa evaluates a tree node, it pushes the result onto the evaluation stack. If a tree node needs values as input, perhaps because a Perl closure requires arguments, the values are popped from the evaluation stack. When Marpa reaches the root tree-node, computes its value and pushes it onto the stack, that value will be the only item on the stack. That one remaining stack item becomes the value of the parse.
The evaluation process can be followed by setting the trace_values
option. The trace_values
output for our example is in an appendix. Here's the section that evaluates the tree nodes shown above:
Popping 1 values to evaluate S4@2-3L0, rule: 1: e -> number
Calculated and pushed value: '2==2'
Popping 3 values to evaluate S6@0-3L0, rule: 0: e -> e op e
Calculated and pushed value: '(2+2)==4'
GRAMMAR REWRITING
Marpa rewrites grammars, adding internal symbols and rules in the process. This rewriting does not affect the semantics, but it does show up when you examine the internals.
Marpa's internal symbols have tags at the end, enclosed in square brackets. This means all Marpa internal symbols end in a right square bracket.
Adding a Start Rule
Many parsers add their own start rule and their own start symbol to grammars. Marpa is no exception. The new start symbol is the old one with "[']
" suffixed. We saw a Marpa internal start symbol above: e[']
. If the grammar allows a null parse, there will also be a nulling start symbol, with "['][]
" suffixed.
Elminating Proper Nullable Symbols
Nulling symbols are those which always produce the empty sentence. Nullable symbols are those which sometimes produce the empty sentence. Non-nullable symbols are those which never produce the empty sentence.
Pedantically, all nulling symbols are also nullable symbols. A proper nullable is any nullable symbol which is not a nulling symbol. In other words, a proper nullable is a symbol that can produce either the empty sentence or a non-empty sentence.
Nullable symbols were a problem in previous versions of Earley parsers. Aycock and Horspool 2002 outlined a new approach for dealing with them. I use their ideas with modifications of my own.
Marpa rewrites its grammar to eliminate proper nullables. It does this by turning every proper nullable into two symbols: a non-nullable variant and a nulling variant. The non-nullable variant keeps the original symbol's name, but is no longer allowed to appear in places where it might be nulled. The name of the nulling variant is that of the original symbol with the nulling tag ("[]
") suffixed. When the nulling variant is used, it must be nulled.
The newly introduced nulling symbols will not appear on any left hand sides. Marpa marks nulling symbols internally and recognizes them directly. It does not need empty rules.
Rules with proper nullables on the rhs have to be replaced with new rules covering every possible combination of the non-nullable and nulling variants. That rewrite is described in the next section.
CHAF Rewriting
To deal with the splitting of rhs proper nullables into two symbols, one non-nullable and one nulling, Aycock and Horspool created new rules covering all possible rhs combinations of non-nullable and nulling symbols. This factoring was exponential in the worst case. I don't like leaving exponential explosions in an algorithm, even unlikely ones. And I suspect that the generation of all possible combinations for an arbitrarily long right hand side might be inefficient in practice.
A result due to Chomsky shows that any grammar can be rewritten as a grammar with at most two symbols on the right hand side. Relaxing Chomsky's rewrite to allow right hand sides with any number of symbols, but at most two proper nullables, produces a rewrite I call CHAF (Chomsky-Horspool-Aycock Form).
CHAF changes the worst case to linear, and in practical cases lowers the multiplier. Here's an example of a CHAF rewrite from Marpa's own self-grammar. First, the rule:
production paragraph:
non structural production sentences,
production sentence,
non structural production sentences,
optional action sentence,
non structural production sentences.
This rule contains four proper nullables, reinforcing my fear that grammars written as test cases won't be the only ones with lots of proper nullables on the right hand side. non structural production sentences
and optional action sentence
are both proper nullables and non structural production sentences
appears three times.
Here's is the output from show_rules
, showing what Marpa did with this rule:
12: production-paragraph
-> non-structural-production-sentences
production-sentence
non-structural-production-sentences
action-sentence:optional
non-structural-production-sentences /* !useful */
99: production-paragraph ->
non-structural-production-sentences
production-sentence
production-paragraph[R12:2][x5e]
/* priority=0.3 */
100: production-paragraph ->
non-structural-production-sentences[]
production-sentence
production-paragraph[R12:2][x5e]
/* priority=0.1 */
101: production-paragraph ->
non-structural-production-sentences
production-sentence
production-paragraph[R12:2][x5e][]
/* priority=0.2 */
102: production-paragraph ->
non-structural-production-sentences[]
production-sentence
production-paragraph[R12:2][x5e][]
103: production-paragraph[R12:2][x5e] ->
non-structural-production-sentences
production-paragraph[R12:3][x60]
/* priority=0.3 */
104: production-paragraph[R12:2][x5e] ->
non-structural-production-sentences[]
production-paragraph[R12:3][x60]
/* priority=0.1 */
105: production-paragraph[R12:2][x5e] ->
non-structural-production-sentences
production-paragraph[R12:3][x60][]
/* priority=0.2 */
106: production-paragraph[R12:3][x60] ->
action-sentence:optional
non-structural-production-sentences
/* priority=0.3 */
107: production-paragraph[R12:3][x60] ->
action-sentence:optional[]
non-structural-production-sentences
/* priority=0.1 */
108: production-paragraph[R12:3][x60] ->
action-sentence:optional
non-structural-production-sentences[]
/* priority=0.2 */
Rule 12 is the original rule. Because Marpa has rewritten it, the rule is marked !useful
, telling later stages in the precomputation to ignore it. Marpa breaks Rule 12 up into three pieces, each with no more than two proper nullables. Rules 99 to 102 are the first piece, with the first two symbols from Rule 12. Rules 103 to 105 are the second, with the 3rd symbol. Rules 106 to 108 are the third, with the 4th and 5th symbols from Rule 12.
Each piece is factored, so that every combination of nulling and non-nullable symbols is included. New symbols are introduced to be the left hand sides of the pieces. The tag "[R12:3]
" indicates that this symbol is the left hand side for the piece of Rule 12 which begins at right hand symbol 3 (the first symbol is symbol 0). The tags beginning with an "x
", like "[x5d]
", are arbitrary hex values, inserted to guarantee that the new symbols are unique.
This rule is a worst case for CHAF, because the last three symbols of the right hand side are all proper nullables. That means that the last two pieces of the original rule can be either empty or non-empty, and therefore that both of the newly created lhs symbols are also proper nullables.
With the CHAF rewrite, there are now a total of 6 proper nullables: the original 4 plus the 2 symbols newly created to serve as left hand sides. This is why, in order to have only 2 proper nullables per piece, the original rule needed to be divided into 3 pieces. The newly created lhs symbols, because they are proper nullables, need to be split into nulling and non-nullable variants and factored, just like the proper nullables in the original rule.
Nonetheless this factoring can be done with 10 rules in CHAF, while the original Aycock-Horspool factoring (NNF) required 16. After more than 4 proper nullables, the advantage of CHAF becomes overwhelming. With 5 proper nullables, there would be 13 rules for CHAF versus 32 for NNF. With 6 proper nullables, 16 versus 64.
The user's semantics are preserved, because Marpa, while splitting the rule into pieces and factoring the pieces, inserts logic to gather and preserve the values of child nodes. These values are presented to the user's actions as if no CHAF rewrite had occurred.
Converting Sequence Productions to BNF
Internally, Marpa converts productions specified as sequences into BNF productions. The conversion is done in a standard way. For example,
paragraphs: empty line separated paragraph sequence.
becomes
2: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5]
3: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5] empty-line
4: paragraph[Seq:1-*][Sep:empty_line][x5] -> paragraph
5: paragraph[Seq:1-*][Sep:empty_line][x5] ->
paragraph[Seq:1-*][Sep:empty_line][x5] empty-line paragraph
In the added symbol, the tag "[Seq:1-*]
" indicates this is a symbol for a sequence of from 1 to an infinite number of symbols and the tag "[Sep:empty_line]
" that it is empty_line
separated.
Here's another example, this time of a sequence without a separator:
definition paragraph: definition sequence.
is written as the following BNF:
9: definition-paragraph -> definition[Seq:1-*][xa]
10: definition[Seq:1-*][xa] -> definition
11: definition[Seq:1-*][xa] -> definition[Seq:1-*][xa] definition
SELF-TUTORIALS
If you want to investigate internals more on your own, here are two "self-tutorials", which should make you pretty much an expert.
First, go through the show_earley_sets
output in the appendix. For each Earley item in it, reason out how it came to be there.
Second, take the grammar used in the example and run it on the input text "2+2*3
". While the parse in this document was not ambiguous, the grammar was. The grammar's ambiguity reveals itself when there is more than one operation in the input string. Get the show_earley_sets
output from any one of the ambiguous parses and reason out how all of its Earley items came to be.
APPENDIX: DATA FOR THE EXAMPLE
Below are the MDL grammar and outputs for the example used in this document. The input text was "2+2
".
MDL Grammar
semantics are perl5. version is 0.211.11.
start symbol is E.
E: E, Op, E.
q{
my ($right_string, $right_value)
= ($_[2] =~ /^(.*)==(.*)$/);
my ($left_string, $left_value)
= ($_[0] =~ /^(.*)==(.*)$/);
my $op = $_[1];
my $value;
if ($op eq "+") {
$value = $left_value + $right_value;
} elsif ($op eq "*") {
$value = $left_value * $right_value;
} elsif ($op eq "-") {
$value = $left_value - $right_value;
} else {
croak("Unknown op: $op");
}
"(" . $left_string . $op . $right_string . ")==" . $value;
}.
E: Number.
q{
my $v0 = pop @_;
$v0 . "==" . $v0;
}.
Number matches qr/\d+/.
Op matches qr/[-+*]/.
the default action is q{
my $v_count = scalar @_;
return "" if $v_count <= 0;
return $_[0] if $v_count == 1;
"(" . join(";", @_) . ")";
}.
show_symbols
Output
0: e, lhs=[0 1], rhs=[0 2]
1: op, lhs=[], rhs=[0] terminal
2: number, lhs=[], rhs=[1] terminal
3: e['], lhs=[2], rhs=[]
show_rules
Output
0: e -> e op e
1: e -> number
2: e['] -> e
show_QDFA
Output
Start States: S0; S1
S0: 7
e['] ::= . e
<e> => S2
S1: predict; 1,5
e ::= . e op e
e ::= . number
<e> => S3
<number> => S4
S2: 8
e['] ::= e .
S3: 2
e ::= e . op e
<op> => S1; S5
S4: 6
e ::= number .
S5: 3
e ::= e op . e
<e> => S6
S6: 4
e ::= e op e .
show_earley_sets
Output
Current Earley Set: 4; Furthest: 3
Earley Set 0
S0@0-0
S1@0-0
Earley Set 1
S4@0-1 [p=S1@0-0; t=2]
S2@0-1 [p=S0@0-0; c=S4@0-1]
S3@0-1 [p=S1@0-0; c=S4@0-1]
Earley Set 2
S5@0-2 [p=S3@0-1; t=+]
S1@2-2
Earley Set 3
S4@2-3 [p=S1@2-2; t=2]
S6@0-3 [p=S5@0-2; c=S4@2-3]
S3@2-3 [p=S1@2-2; c=S4@2-3]
S2@0-3 [p=S0@0-0; c=S6@0-3]
S3@0-3 [p=S1@0-0; c=S6@0-3]
show_bocage
Output
package: Parse::Marpa::E_1; parse count: 0
S2@0-3L3 ::= S2@0-3L3[0]
S2@0-3L3[0] ::= S6@0-3L0
rule 2: e['] ::= e .
rhs length = 1; closure
S6@0-3L0 ::= S6@0-3L0[0]
S6@0-3L0[0] ::= S5@0-2R0:2 S4@2-3L0
rule 0: e ::= e op e .
rhs length = 3; closure
S5@0-2R0:2 ::= S5@0-2R0:2[0]
S5@0-2R0:2[0] ::= S3@0-1R0:1 '+'
rule 0: e ::= e op . e
rhs length = 3
S4@2-3L0 ::= S4@2-3L0[0]
S4@2-3L0[0] ::= '2'
rule 1: e ::= number .
rhs length = 1; closure
S3@0-1R0:1 ::= S3@0-1R0:1[0]
S3@0-1R0:1[0] ::= S4@0-1L0
rule 0: e ::= e . op e
rhs length = 3
S4@0-1L0 ::= S4@0-1L0[0]
S4@0-1L0[0] ::= '2'
rule 1: e ::= number .
rhs length = 1; closure
show_tree
Output
Tree Node: S2@0-3L3[0]; Depth = 0; Rhs Length = 1
Rule: e['] ::= e .
Closure: S6@0-3L0
Perl Closure: Y
Tree Node: S6@0-3L0[0]; Depth = 1; Rhs Length = 3
Rule: e ::= e op e .
Kernel: S5@0-2R0:2
Closure: S4@2-3L0
Perl Closure: Y
Tree Node: S4@2-3L0[0]; Depth = 2; Rhs Length = 1
Rule: e ::= number .
Perl Closure: Y; Token: '2'
Tree Node: S5@0-2R0:2[0]; Depth = 2; Rhs Length = 3
Rule: e ::= e op . e
Kernel: S3@0-1R0:1
Perl Closure: N; Token: '+'
Tree Node: S3@0-1R0:1[0]; Depth = 3; Rhs Length = 3
Rule: e ::= e . op e
Closure: S4@0-1L0
Perl Closure: N
Tree Node: S4@0-1L0[0]; Depth = 4; Rhs Length = 1
Rule: e ::= number .
Perl Closure: Y; Token: '2'
trace_values
Output
Pushed value: '2'
Popping 1 values to evaluate S4@0-1L0, rule: 1: e -> number
Calculated and pushed value: '2==2'
Pushed value: '+'
Pushed value: '2'
Popping 1 values to evaluate S4@2-3L0, rule: 1: e -> number
Calculated and pushed value: '2==2'
Popping 3 values to evaluate S6@0-3L0, rule: 0: e -> e op e
Calculated and pushed value: '(2+2)==4'
Popping 1 values to evaluate S2@0-3L3, rule: 2: e['] -> e
Calculated and pushed value: '(2+2)==4'
(2+2)==4
SUPPORT
See the support section in the main module.
AUTHOR
Jeffrey Kegler
LICENSE AND COPYRIGHT
Copyright 2007 - 2008 Jeffrey Kegler
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.