NAME

Parse::Marpa::Doc::Internals - Marpa Internals

OVERVIEW

Knowledge of Marpa internals is not necessary to use Marpa. But you can make better use of Marpa if you understand how it works. This document assumes that you have read the other Marpa documentation, and that you have worked with another parser generator, or have a general knowledge of parsing.

EARLEY SETS

To speak pedantically, the algorithm in Earley's 1970 parsing article does not actually parse in the strict sense. It is a recognizer, not a parser.

Earley sets are built as tokens are recognized. If an Earley item of the correct form is in the right place in these sets after input is finished, the input is recognized as one of those described by the grammar. Once the recognition phase is done, the phase that does parsing in the strict sense uses the Earley sets as its input.

Each of Marpa's earlemes corresponds one-to-one with an Earley set. An Earley set is a set of Earley items. Each Earley item has a QDFA state, an origin earleme and a dot earleme. An Earley item's dot earleme is also called its current earleme. Here's a representation of an Earley item, as you might see it in debugging output from Marpa:

S5@6-7

This Earley item is for QDFA state 5 (that is the "S5" part), The "@6-7" part says that this Earley item originates at earleme 6, and is current at earleme 7. The number of an Earley item's current, or dot, earleme is always the same as number of the Earley set that contains the item.

(Note to experts: Those familiar with Earley parsing will note that S5@6-7 does not look like a traditional Earley item. QDFA states are not traditional -- they are my elaboration of an invention by Aycock and Horspool. The use of the at-sign as punctuation follows Grune and Jacobs. Finally, in the traditional notation for an Earley item, the dot earleme is not included at all -- you're supposed to figure out from the context which Earley set an Earley item belongs to.)

QDFA STATES

I will mention QDFA's (Quasi-Deterministic Finite Automata) frequently and NFA's (Non-deterministic Finite Automata) sometimes. All you need to know about NFA's is that the QDFA's are built from them. NFA state numbers sometimes appear in the diagnostic outputs. They can be ignored.

About QDFA's, it will be necessary to know more. Let's start with an example of another Earley item:

S1@2-2

This states that this item is for QDFA state 1; that it originates at earleme 2; and that it is currently at (or has its dot position at) earleme 2. We can get a description of the QDFA states with the show_QDFA method. Here's what it says about QDFA state 1:

S1: predict; 1,5
e ::= . e op e
e ::= . number
 <e> => S3
 <number> => S4

For those who like the big picture, or who simply like to skim ahead, the entire show_QDFA output, along with the other trace and debugging outputs for this example, is in the appendices.

The first line of show_QDFA's description of QDFA state 1 begins with its name: S1. Next comes the predict tag. That indicates this item was created by a prediction. That's one of the four ways an Earley item can be created. There's more on that below.

The two numbers after the "predict" tag on the first line may be ignored. They are the numbers of the NFA states that were combined into this QDFA state. Marpa uses QDFA states to track the parse. Every QDFA state has one or more LR(0) items. In this example, the second and third lines are LR(0) items. They are rules with a dot added to indicate how far recognition has proceeded into the rule.

A dot between two symbols indicates that all symbols before the dot have been recognized, but that none of the symbols after the dot have been recognized. A dot at the beginning of an LR(0) item indicates that none of its symbols have yet been recognized. A dot at the end indicates that the rule is complete -- all of its symbols have been recognized. The location of the dot is called the dot position.

The last two lines in the display show transitions. The first line says that, on seeing an e, you transition to QDFA state 3. The second states that, on seeing a number, you transition to QDFA state 4.

It's important not to confuse Earley items with LR(0) items. Earley items are built from one or more LR(0) items. In traditional Earley parsing, each Earley item contained one and only one LR(0) item. This made the internals simpler, but was not as efficient. Marpa combines LR(0) items into QDFA states based on ideas in Aycock and Horspool 2002.

Every LR(0) item is a statement about the recognition of a rule. Every QDFA state is a set of statements about rule recognition. Every LR(0) item in the QDFA state of an Earley item has been recognized as far as its dot position. The string of recognized symbols ends at the dot location of the Earley item -- in other words, the dot position of each LR(0) item corresponds to the dot location of the Earley item. The start of the string of recognized symbols, which is also the start of the rule, will be the at origin of the Earley item.

Some LR(0) items are predictions -- statements that their rules could start at the dot location of the Earley item. In predictions, no symbols have been recognized, and the origin and the dot location will be the same. Our example is a prediction -- both origin and dot location are at earleme 2. we can see that that is the case from the "2-2" in the description of Earley item S1@2-2.

HOW EARLEY SETS ARE BUILT

New items come into the Earley sets in four ways: scanning, completion, prediction, and initialization.

Scanning

Scanning adds Earley items to indicate which tokens have been seen in the input, and where. Suppose the Earley item S1@2-2 (mentioned above) is present at earleme 2. Marpa knows to instruct the lexer to look for a number token because there is a transition from S1 to S4 on number.

If the lexer finds a number token with a length (in earlemes) of 1 at earleme 2, a new Earley item S4@2-3 is added at earleme 3. (Earleme 3 because 3 is 2+1, the current token plus the token length.) Here (again from show_QDFA) is the description of QDFA state 4:

S4: 6
e ::= number .

We can see that there is only one LR(0) item in QDFA state 4, and that it has the dot pointer at the end of its rule. An LR(0) item with the dot pointer at the end of its rule is a completed rule.

Marpa calls the item that was looking for the scanned symbol the predecessor of the item added by the scan. In this example, S1@2-2 is the predecessor of S4@2-3. Any item which is added to an Earley set based on a QDFA state transition from a predecessor, is called that predecessor's successor. In this example, S4@2-3 is the successor of S1@2-2.

An Example

Here's what S4@2-3 looks like in the Earley sets, after a successful scan of the input "2+2".

S4@2-3  [p=S1@2-2; t=2]

After the name of the Earley item is a Earley item source choice, shown in brackets. When the context makes the meaning clear, an Earley item source choice may be called a source or a source choice. This parse is not ambiguous, so all Earley items have at most one source choice entry.

In this case the source choice entry is for a token choice. Token choices are shown as [p=predecessor; t=token_value] pairs. In the above example, the token choice [p=S1@2-2; t=2] indicates that the token has a value of "2", and that, when this token choice is selected, the predecessor of S4@2-3 is S1@2-2. Of course, since there's only one source choice, the predecessor of S4@2-3 is always S1@2-2.

Completion

Whenever an Earley item with a completed rule is added to an Earley set, other Earley items may be added to that Earley set as a result. The item with the completed rule is called the completion, or the cause, and any new item added as a result is called an effect.

Let's look at one example of a completion "cause and effect". As the completion in our example, we'll use S4@2-3, the Earley item in our example of a scanned item. Its state was QDFA state 4. QDFA state 4 contained the LR(0) item

e ::= number .

The dot is at the far right, so it's a completed rule whose the left hand side is e.

S4@2-3 begins at earleme 2 and has a complete rule with e as its lhs, and therefore any Earley item at earleme 2 which is looking for an e is in luck. S5@0-2 is such a rule. QDFA state 5 looks like this:

S5: 3
e ::= e op . e
 <e> => S6

This says that in QDFA state 5, when you see an e, you transition to QDFA state 6. Here's S6@0-3, the item which is the effect of the completion:

S6@0-3 [p=S5@0-2; c=S4@2-3]

For completion items, the terms predecessor and successor mean pretty much what they did in token scanning. Since S6@0-3 comes from moving the dot forward in a rule in S5@0-2, we say S5@0-2 is the predecessor of S6@0-3, and that S6@0-3 is the successor of S5@0-2.

The same Earley item can be added to the same Earley set for multiple reasons. A rule can be added because of one or more scanned tokens; because of one or more completions; or because of any combination of these.

Traditionally in Earley's algorithm, the same item is not added twice. Where an item comes from is not important in recognition. But parsing in the strict sense means determining the structure of the input, and that requires us to know why an Earley item was added to an Earley set. Each reason for the existence of an Earley item in an Earley set is an different choice that can be made when it comes time to create a parse structure.

Marpa uses the Earley item source choices to track the reasons why Earley items were added. Aycock and Horspool don't call them source choices. Marpa's method is derived from the one Aycock and Horspool describe in their 2002 article.

We've already seen a token source choice in the scanning example. Earley item S6@0-3 has another example of a source choice: [p=S5@0-2; c=S4@2-3]. This is another kind of source choice, a completion source choice. This source choice says that one reason for S6@0-3 to exist was the completion in S4@2-3, with S5@0-2 as the predecessor. Since this parse in unambiguous, there will be no other source choices, and this will remain the only reason for S6@0-3 to be in the Earley sets.

Prediction

A third source of Earley items is prediction. Whenever an Earley item is expecting to see an e, for example, it can also expect to see the start of all the rules that have an e on their left hand side.

Recall QDFA state 5:

S5: 3
e ::= e op . e
 <e> => S6

QDFA state 5 is expecting an e. This means that the rules

e ::= . e op e
e ::= . number

can also be expected to originate at this earleme. Marpa is smart about grouping rules into QDFA states and both these rules are in QDFA state 1:

S1: predict; 1,5
e ::= . e op e
e ::= . number
 <e> => S3
 <number> => S4

Since S5@0-2 is at earleme 2, and it is expecting the symbols on the lhs of the items in QDFA state 1, there will also be an Earley item for QDFA state 1 at earleme 2:

S1@2-2

In predicted items, either the dot is at the beginning of the rule, or else all the symbols to the left of the dot are nulling symbols. Either way, the origin and dot earlemes will always be the same.

Source choices are not recorded for predicted items. Prediction items are essential to get rules started in the Earley sets. But they are not actually needed when the time comes for parsing and evaluation. That means they don't need source choice entries.

Initialization

One or two Earley items are put into Earley set 0 to start things off. In our example, the initial Earley items are

S0@0-0
S1@0-0

QDFA state 0 contains the start rules, with the dot pointer at the beginning. Only Earley set 0 will contain an Earley item for QDFA state 0. S1@0-0 contains the rules predicted by S0@0-0.

HOW A SUCCESSFUL PARSE IS RECOGNIZED

As mentioned, what's usually called Earley's algorithm is just a recognizer, an algorithm to build the Earley sets. The input is recognized successfully if, at the last earleme, there is a an Earley item for a completed start rule state. A completed start rule state is a QDFA state containing a completed start rule. A completed start rule is the LR(0) item for a start rule that has its dot position at the end of that rule.

At most two of the QDFA states can be completed start rule states. One is a special case. In a null parse, the start state, QDFA state 0, is also a completed start rule state.

The other completed start rule state is not hard to spot. A completed start rule state for non-null parses contains the LR(0) item which has Marpa's internal start rule with the dot pointer at the end of the rule. The lhs of the internal start rule is a special internal start symbol. The rhs of the internal start rule is the grammar's original start symbol. QDFA state 2 is the completed start rule state for our example parse:

S2: 8
e['] ::= e .

A parse is successful at earleme N if it contains an Earley item for a completed start rule state with a origin earleme of 0 and a dot earleme of N. Here's the Earley item which makes the parse in our example successful at earleme 3:

S2@0-3 [p=S0@0-0; c=S6@0-3]
  

THE PARSE BOCAGE

After a successful parse has been recognized, it can be evaluated. The main data structure of a Marpa evaluator is a parse bocage.

Each Marpa evaluator is created with parse criteria. The parse criteria are the start and end earlemes of the parse, and the parse's top symbol. Often these parse criteria take default values. The default start earleme is earleme 0. The default end earleme is the furthest earleme reached by any token in the parse. The default top symbol is the start symbol of the grammar.

The parse bocage stores all parses meeting the evaluator's parse criteria. In other general parsers, a parse forest is been used to store ambiguous parses. Marpa's parse bocage is an adaptation of a parse forest. Parse bocages differ from parse forests in three ways: parse bocages may contain cycles; and-nodes and or-nodes strictly alternate in a parse bocage; and and-nodes are limited to two rhs symbols in parse bocages.

Parse forests and parse bocages both consist of and-nodes and or-nodes. In both parse forests and parse bocages, and-nodes contain pieces of parse trees, while or-nodes represent choices among the pieces.

In parse forests, the and-nodes directly represent productions of the original grammar. Marpa's parse bocages, on the other hand, often must represent the original grammar's production in pieces. Marpa needs to represent the original's grammars productions in pieces in order to ensure that no more than two symbols are on any rhs. I call Marpa's piecemeal representation of the original productions, Marpa Bocage Form (MBF). In restricting the rhs to at most two symbols, MBF is very similar to Chomsky Normal Form (CNF).

Marpa uses MBF for four reasons. First, MBF is essentially the form in which the grammar is found in the Earley items. Every source choice of an Earley item contains at most two links. The links correspond to symbols on the rhs of a production. So the original grammar is found in the Earley items in a forma that has at most two symbols on the right hand side.

Second, MBF is more compact than a parse forest. Third, MBF can handle parses with cycles. Fourth, MBF easily produces binary trees. Binary trees are fast and easy to traverse and manipulate. The effort spent to rewrite the parse data found in the Earley sets back into the productions of the original grammar would be counter-productive. The result would be a slower and more cumbersome data structure.

Every and-node represents either a production of the original grammar, or a piece of one. In a parse bocage, and-nodes and or-nodes strictly alternate. Every child of an and-node is an or-node. Every child of an or-node is an and-node. An or-node is trivial if it has only one child and-node. A trivial or-node represents a choice from a list containing only a single alternative. Marpa or-nodes are often trivial.

Types of Or-Nodes

Each or-node corresponds to a rule, a position in that rule, and an earleme location. Based on its rule and position, each or-node also corresponds to a symbol.

Based on its rule, rule position, and earleme location an or-node also corresponds to an Earley item. An or-node with its rule position at the end of the rule, or just before a non-nulled symbol, is called a brick or-node. The Earley item corresponding to a brick or-node has an earleme location which exactly matches the earleme location of the brick or-node and has a QDFA state which contains an LR(0) item which exactly matches the brick or-node's rule and rule position are the same as those of the or-node.

This is an or-node for every symbol of any possible parse. This means there are or-nodes which have rule positions in front of nulled symbols. These or-nodes are called mortar or-nodes, because they fill the cracks in between brick or-nodes.

No Earley items correspond exactly to mortar or-nodes. This is because Marpa has no LR(0) items with dot positions before nulled symbols.

The corresponding Earley item for a mortar or-node is the Earley item which exactly matches the rule and the earleme location of the mortar or-node, and which contains the "next" LR(0) item for the or-node's rule. The "next" LR(0) item is the one with its rule position in front of the first non-nulled symbol after the mortar or-node's rule position. If there is no non-nulled symbol in the rule after the or-node's rule position, the "next" LR(0) item is the one with its rule position at the end of the rule.

Or-nodes with their positions at the end of rules are closure or-nodes. All other or-nodes are kernel or-nodes.

Or-nodes names consist of the name of the corresponding Earley item, followed by a label indicating the selection criteria. The selection criteria depend on whether the or-node is a kernel or a closure or-node.

The selection criterion for a closure or-node is a symbol. All LR(0) items selected for and-node children of a closure or-node must be completed rules with the selection criterion symbol on their lhs. In an Earley item, several completed LR(0) items may have the same lhs.

The selection criteria for a kernel or-node is a rule, and a position in that rule. For a brick or-node, exactly one LR(0) item will match. and the child and-nodes will use that LR(0) item. For a mortar or-node, no LR(0) items will match, and a child and-node will be created using the null value of the symbol corresponding to the mortar or-node.

Naming Or-Nodes

The name of a kernel or-node is the name of its corresponding Earley item, followed by a label describing its LR(0) item selection criteria. The label consists of the letter "R", and the number of the or-node's rule and its rule position, colon-separated. The "R" in the label is a mnemonic for "rule".

For example, the kernel or-node named S3@0-1R0:1 corresponds to Earley item S3@0-1. The selection criteria are rule 0 at position 1 (e -> e . op e).

For closure or-nodes, the selection criteria label is the letter "L", followed by a symbol number. The "L" in the symbol or-node's label is a mnemonic for "left hand side".

An example of a closure or-node is S6@0-3L0. In this case the Earley item was S6@0-3, and it was the effect of completing the symbol (e), which was symbol 0. (Symbol numbers for all symbols may be found in the appendix, in the show_symbols output.)

Naming And-Nodes

An and-node is always a child of an or-node. The name of an and-node is the name of the parent or-node, followed by a numerical index in brackets. The child and-nodes are numbered starting at 0. For example, S6@0-3L0[0] is the first child of the or-node named S6@0-3L0.

The Parse Bocage Grammar

Marpa typically displays a parse bocage as a parse bocage grammar (PBG). It may seem pointless to have come this far in the parsing, only to write out another grammar. But there is a big difference between the original grammar and the parse bocage grammar. The original grammar describes all parses which are possible for any input. The PBG describes only those parses which are possible for the specific input which was accepted by the recognizer. In the parse bocage grammar, the work of parsing a specific input has actually been done.

Examples of Parse Bocage Productions

S2@0-3L3 ::= S2@0-3L3[0]

The above is the start production of the parse bocage. The left hand side is an or-node which corresponds to a start rule completion item. The completion start rule state in our example is QDFA state 2 and the input in our example was 3 earlemes long. So the start Earley item is S2@0-3. The start symbol is e['], which is symbol number 3. So the selection criterion label is "L3".

The rhs of the start or-production is the and-node which is its first and only child, It has index 0. Start or-productions are always trivial.

A PBG always starts with an or-node. Or-nodes appear in the PBG as one or more or-productions. Each or-production has the name of an or-node at its lhs. Also, each or-production has the name of one of the or-node's child and-nodes as the only symbol on its rhs. In the example above, the rhs and-node is the first child of the lhs or-node.

In many cases, the or-nodes are trivial -- they have exactly one child. Trivial or-nodes have only one or-production in the PBG. In an unambiguous parse, like the parse of our example, all the or-nodes are trivial.

Even when they are non-trivial, or-productions contain little information, and show_bocage does not show or-productions unless called with the verbose option. Or-productions are necessary for the PBG to be a real BNF grammar, but the information in them is easily deduced from the names of the and-nodes on the lhs of the and-productions. For every and-node on the lhs of an and-production, there will be always be an or-production with the unbracketed name of the and-node node as the or-production's lhs, and with the and-node itself as or-production's only rhs symbol. For example, from the name of the and-node S2@0-3L3[0], we can deduce existence of the or-production in the example above.

The start or-node is always trivial. Here is the and-production which is its only child:

S2@0-3L3[0] ::= S6@0-3L0
    rule 2: e['] ::= e .
    rhs length = 1; closure

Parse bocage and-nodes always have one or two rhs symbols. Either a token or a closure or-node will be on the right hand side. One or the other will always be present, but never both. Additionally, there may be a kernel or-node. If a kernel or-node is present, its name is always the first symbol on the rhs.

In this example, the only symbol on the rhs is a closure or-node: S6@0-3L0. The next line contains a dotted rule which shows the rule and rule position for this and-node.

The last line tells us the length of the rule's rhs, and that there is a Perl closure corresponding to this rule. Perl closures and closure or-nodes are not to be confused. The word "closure" in "closure or-node" refers to the fact that it corresponds to the closing-out of a rule, and has nothing to do with whether it might have a corresponding Perl closure.

S5@0-2R0:2 ::= S5@0-2R0:2[0]
S5@0-2R0:2[0] ::= S3@0-1R0:1 '+'
    rule 0: e ::= e op . e
    rhs length = 3

This example shows an and-node with a kernel or-node and a token on the rhs of its rule. The and-node is at position 2 of rule 0, and shown by its selection criteria label ("R0:2") and in the dotted rule on the third line. The symbol before the dot ("op") matched the token ("+"). The kernel or-node will contains information for deriving the first symbol of rule 0 ("e").

S4@0-1L0 ::= S4@0-1L0[0]
S4@0-1L0[0] ::= '2'
    rule 1: e ::= number .
    rhs length = 1; closure

The and-production above has only a token on its rhs. There is no kernel or-node, because the current symbol for this and-production is the first one of its rule -- in fact, it is the only symbol in its rule.

S4@0-1L0 is a closure or-node with "L0" as its selection criteria. Since the current symbol is final, this is a brick or-node, and there will be a corresponding LR(0) item. The LR(0) item will be in QDFA state 4, and will be a completed rule with symbol number 0 on its lhs. Symbol number 0 is e. There is, as it happens, only one LR(0) in QDFA state 4, but it is, as required, a completed rule with e as its lhs.

THE PARSE TREES

The parse tree for our example parse is shown in an appendix. Every node in a Marpa parse tree corresponds in an and-node in the parse bocage.

A parse bocage contains all possible parses, but a parse tree is for a specific parse. Only those and-nodes actually used in the specific parse be correspond to nodes of the parse tree. In the case of an unambiguous parse, every parse bocage and-node corresponds to a tree node. If the parse bocage contains an ambiguous parse, some and-nodes will not have corresponding tree nodes.

Tree Node: S6@0-3L0[0]; Depth = 1; Rhs Length = 3
    Rule: e ::= e op e .
    Kernel: S5@0-2R0:2
    Closure: S4@2-3L0
    Perl Closure: Y
Tree Node: S4@2-3L0[0]; Depth = 2; Rhs Length = 1
    Rule: e ::= number .
    Perl Closure: Y; Token: '2'

Most of the information in the tree nodes is repeated from the parse bocage and-nodes. The "depth" is the distance of the tree node from the root of the tree.

Marpa's parse trees are kept in pre-order. They can be traversed as a tree, starting from the root. But they can also be traversed from the end. Marpa does this when it evaluates a parse.

As Marpa evaluates tree nodes, it computes values and pushes them onto an evaluation stack. When a tree node needs values for its computation, perhaps because a Perl closure requires them as arguments, they are popped from the evaluation stack. The result is pushed back onto the stack. When Marpa computes the value for the root tree-node, the result will be the only value left on the stack, and will be the value of the parse.

The evaluation process can be followed by setting the trace_values option. The trace_values output for our example is in an appendix. Here's the section that evaluates the tree nodes shown above:

Popping 1 values to evaluate S4@2-3L0, rule: 1: e -> number
Calculated and pushed value: '2==2'
Popping 3 values to evaluate S6@0-3L0, rule: 0: e -> e op e
Calculated and pushed value: '(2+2)==4'

GRAMMAR REWRITING

Marpa rewrites grammars, adding internal symbols and rules in the process. This rewriting does not affect the semantics, but it does show up when you examine the internals.

Marpa's internal symbols have tags at the end, enclosed in square brackets. This means all Marpa internal symbols end in a right square bracket.

Adding a Start Rule

Many parsers add their own start rule and their own start symbol to grammars. Marpa is no exception. The new start symbol is the old one with "[']" suffixed. We saw a Marpa internal start symbol above: e[']. If the grammar allows a null parse, there will also be a nulling start symbol, with "['][]" suffixed.

Elminating Proper Nullable Symbols

Nulling symbols are those which always produce the empty sentence. Nullable symbols are those which sometimes produce the empty sentence. Non-nullable symbols are those which never produce the empty sentence.

Pedantically, all nulling symbols are also nullable symbols. A proper nullable is any nullable symbol which is not a nulling symbol. In other words, a proper nullable is a symbol that can produce either the empty sentence or a non-empty sentence.

Nullable symbols were a problem in previous versions of Earley parsers. Aycock and Horspool 2002 outlined a new approach for dealing with them. I use their ideas with modifications of my own.

Marpa rewrites its grammar to eliminate proper nullables. It does this by turning the original proper nullable into a non-nullable symbol, and creating a second symbol to serve as the original symbol's nullable variant. The non-nullable variant of the original symbol keeps the original name, but is no longer allowed to appear in places where it might be nulled. In places where the original symbol was nulled, the nulling variant is substituted. The name of the nulling variant is that of the original symbol with the nulling tag ("[]") suffixed.

The newly introduced nulling symbols will not appear on any left hand sides. Marpa marks nulling symbols internally and recognizes them directly. It does not need empty rules.

Getting rid of proper nullables on the right hand side is more difficult than getting rid of them on the left hand side. Rules with proper nullables on the rhs have to be replaced with new rules covering every possible combination of the non-nullable and nulling variants. That rewrite is described in the next section.

CHAF Rewriting

To deal with the splitting of rhs proper nullables into two symbols, one non-nullable and one nulling, Aycock and Horspool created new rules covering all possible rhs combinations of non-nullable and nulling symbols. This factoring was exponential in the worst case. I don't like leaving exponential explosions in an algorithm, even unlikely ones. And I felt the generation of all possible combinations for an arbitrarily long right hand side might be inefficient in practice.

A result due to Chomsky shows that any grammar can be rewritten as a grammar with at most two symbols on the right hand side. Relaxing Chomsky's rewrite to allow right hand sides with any number of symbols, but at most two proper nullables, produces a rewrite I call CHAF (Chomsky-Horspool-Aycock Form).

CHAF changes the worst case to linear, and in practical cases lowers the multiplier. Here's an example of a CHAF rewrite from Marpa's own self-grammar. First, the rule:

production paragraph:
    non structural production sentences,
    production sentence,
    non structural production sentences,
    optional action sentence,
    non structural production sentences.

This rule contains four proper nullables, reinforcing my fear that grammars written as test cases won't be the only ones with lots of proper nullables on the right hand side. non structural production sentences and optional action sentence are both proper nullables and non structural production sentences appears three times.

Here's is the output from show_rules, showing what Marpa did with this rule:

    12: production-paragraph
	    -> non-structural-production-sentences
	    production-sentence
	    non-structural-production-sentences
	    action-sentence:optional
	    non-structural-production-sentences /* !useful */

99: production-paragraph ->
    non-structural-production-sentences
    production-sentence
    production-paragraph[R12:2][x5e]
    /* priority=0.3 */
100: production-paragraph ->
    non-structural-production-sentences[]
    production-sentence
    production-paragraph[R12:2][x5e]
    /* priority=0.1 */
101: production-paragraph ->
    non-structural-production-sentences
    production-sentence
    production-paragraph[R12:2][x5e][]
    /* priority=0.2 */
102: production-paragraph ->
    non-structural-production-sentences[]
    production-sentence
    production-paragraph[R12:2][x5e][]

103: production-paragraph[R12:2][x5e] ->
    non-structural-production-sentences
    production-paragraph[R12:3][x60]
    /* priority=0.3 */
104: production-paragraph[R12:2][x5e] ->
    non-structural-production-sentences[]
    production-paragraph[R12:3][x60]
    /* priority=0.1 */
105: production-paragraph[R12:2][x5e] ->
    non-structural-production-sentences
    production-paragraph[R12:3][x60][]
    /* priority=0.2 */

106: production-paragraph[R12:3][x60] ->
    action-sentence:optional
    non-structural-production-sentences
    /* priority=0.3 */
107: production-paragraph[R12:3][x60] ->
    action-sentence:optional[]
    non-structural-production-sentences
    /* priority=0.1 */
108: production-paragraph[R12:3][x60] ->
    action-sentence:optional
    non-structural-production-sentences[]
    /* priority=0.2 */

Rule 12 is the original rule. Because Marpa has rewritten it, the rule is marked !useful, telling later stages in the precomputation to ignore it. Marpa breaks Rule 12 up into three pieces, each with no more than two proper nullables. Rules 99 to 102 are the first piece, with the first two symbols from Rule 12. Rules 103 to 105 are the second, with the 3rd symbol. Rules 106 to 108 are the third, with the 4th and 5th symbols from Rule 12.

Each piece is factored, so that every combination of nulling and non-nullable symbols is included. New symbols are introduced to be the left hand sides of the pieces. The tag "[R12:3]" indicates that this symbol is the left hand side for the piece of Rule 12 which begins at right hand symbol 3 (the first symbol is symbol 0). The tags beginning with an "x", like "[x5d]", are arbitrary unique hex values, inserted to guarantee that the new symbols are unique.

This rule is a worst case for CHAF, because the last three symbols of the right hand side are all proper nullables. That means that the last two pieces of the original rule can be either empty or non-empty, and therefore that both of the newly created lhs symbols are also proper nullables.

With the CHAF rewrite, there are now a total of 6 proper nullables: the original 4 plus the 2 symbols newly created to serve as left hand sides. This is why, in order to have only 2 proper nullables per piece, the original rule needed to be divided into 3 pieces. The newly created lhs symbols, because they are proper nullables, need to be split into nulling and non-nullable variants and factored, just like the proper nullables in the original rule.

Nonetheless this factoring can be done with 10 rules in CHAF, while the original Aycock-Horspool factoring (NNF) required 16. After more than 4 proper nullables, the advantage of CHAF becomes overwhelming. With 5 proper nullables, there would be 13 rules for CHAF versus 32 for NNF. With 6 proper nullables, 16 versus 64.

The user's semantics are preserved, because Marpa, while splitting the rule into pieces and factoring the pieces, inserts logic to gather and preserve the values of child nodes. These values are presented to the user's actions as if no CHAF rewrite had occurred.

Converting Sequence Productions to BNF

Internally, Marpa converts productions specified as sequences into BNF productions. The conversion is done in a standard way. For example,

paragraphs: empty line separated paragraph sequence.

becomes

    2: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5]
    3: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5] empty-line
    4: paragraph[Seq:1-*][Sep:empty_line][x5] -> paragraph
    5: paragraph[Seq:1-*][Sep:empty_line][x5] ->
	paragraph[Seq:1-*][Sep:empty_line][x5] empty-line paragraph

In the added symbol, the tag "[Seq:1-*]" indicates this is a symbol for a sequence of from 1 to an infinite number of symbols and the tag "[Sep:empty_line]" that it is empty_line separated.

Here's another example, this time of a sequence without a separator:

definition paragraph: definition sequence. concatenate lines.

is written as the following BNF:

9: definition-paragraph -> definition[Seq:1-*][xa]
10: definition[Seq:1-*][xa] -> definition
11: definition[Seq:1-*][xa] -> definition[Seq:1-*][xa] definition

SELF-TUTORIALS

If you want to investigate internals more on your own, here are two "self-tutorials", which should make you pretty much an expert.

First, go through the show_earley_sets output in the appendix. For each Earley item in it, reason out how it came to be there.

Second, take the grammar used in the example and run it on the input text "2+2*3". While the parse in this document was not ambiguous, the grammar was. The grammar's ambiguity reveals itself when there is more than one operation in the input string. Get the show_earley_sets output from any one of the ambiguous parses and reason out how all of its Earley items came to be.

APPENDIX: DATA FOR THE EXAMPLE

Below are the MDL grammar and outputs for the example used in this document. The input text was "2+2".

MDL Grammar

    semantics are perl5.  version is 0.211.9.

    start symbol is E.

	    E: E, Op, E.
    q{
		my ($right_string, $right_value)
		    = ($_[2] =~ /^(.*)==(.*)$/);
		my ($left_string, $left_value)
		    = ($_[0] =~ /^(.*)==(.*)$/);
		my $op = $_[1];
		my $value;
		if ($op eq "+") {
		   $value = $left_value + $right_value;
		} elsif ($op eq "*") {
		   $value = $left_value * $right_value;
		} elsif ($op eq "-") {
		   $value = $left_value - $right_value;
		} else {
		   croak("Unknown op: $op");
		}
		"(" . $left_string . $op . $right_string . ")==" . $value;
    }.

    E: Number.
    q{
	       my $v0 = pop @_;
	       $v0 . "==" . $v0;
    }.

    Number matches qr/\d+/.

    Op matches qr/[-+*]/.
     
    the default action is q{
	     my $v_count = scalar @_;
	     return "" if $v_count <= 0;
	     return $_[0] if $v_count == 1;
	     "(" . join(";", @_) . ")";
	}.

show_symbols Output

0: e, lhs=[0 1], rhs=[0 2]
1: op, lhs=[], rhs=[0] terminal
2: number, lhs=[], rhs=[1] terminal
3: e['], lhs=[2], rhs=[]

show_rules Output

0: e -> e op e
1: e -> number
2: e['] -> e

show_QDFA Output

Start States: S0; S1
S0: 7
e['] ::= . e
 <e> => S2
S1: predict; 1,5
e ::= . e op e
e ::= . number
 <e> => S3
 <number> => S4
S2: 8
e['] ::= e .
S3: 2
e ::= e . op e
 <op> => S1; S5
S4: 6
e ::= number .
S5: 3
e ::= e op . e
 <e> => S6
S6: 4
e ::= e op e .

show_earley_sets Output

Current Earley Set: 4; Furthest: 3
Earley Set 0
S0@0-0
S1@0-0
Earley Set 1
S4@0-1 [p=S1@0-0; t=2]
S2@0-1 [p=S0@0-0; c=S4@0-1]
S3@0-1 [p=S1@0-0; c=S4@0-1]
Earley Set 2
S5@0-2 [p=S3@0-1; t=+]
S1@2-2
Earley Set 3
S4@2-3 [p=S1@2-2; t=2]
S6@0-3 [p=S5@0-2; c=S4@2-3]
S3@2-3 [p=S1@2-2; c=S4@2-3]
S2@0-3 [p=S0@0-0; c=S6@0-3]
S3@0-3 [p=S1@0-0; c=S6@0-3]

show_bocage Output

package: Parse::Marpa::E_1; parse count: 0
S2@0-3L3 ::= S2@0-3L3[0]
S2@0-3L3[0] ::= S6@0-3L0
    rule 2: e['] ::= e .
    rhs length = 1; closure
S6@0-3L0 ::= S6@0-3L0[0]
S6@0-3L0[0] ::= S5@0-2R0:2 S4@2-3L0
    rule 0: e ::= e op e .
    rhs length = 3; closure
S5@0-2R0:2 ::= S5@0-2R0:2[0]
S5@0-2R0:2[0] ::= S3@0-1R0:1 '+'
    rule 0: e ::= e op . e
    rhs length = 3
S4@2-3L0 ::= S4@2-3L0[0]
S4@2-3L0[0] ::= '2'
    rule 1: e ::= number .
    rhs length = 1; closure
S3@0-1R0:1 ::= S3@0-1R0:1[0]
S3@0-1R0:1[0] ::= S4@0-1L0
    rule 0: e ::= e . op e
    rhs length = 3
S4@0-1L0 ::= S4@0-1L0[0]
S4@0-1L0[0] ::= '2'
    rule 1: e ::= number .
    rhs length = 1; closure

show_tree Output

Tree Node: S2@0-3L3[0]; Depth = 0; Rhs Length = 1
    Rule: e['] ::= e .
    Closure: S6@0-3L0
    Perl Closure: Y
Tree Node: S6@0-3L0[0]; Depth = 1; Rhs Length = 3
    Rule: e ::= e op e .
    Kernel: S5@0-2R0:2
    Closure: S4@2-3L0
    Perl Closure: Y
Tree Node: S4@2-3L0[0]; Depth = 2; Rhs Length = 1
    Rule: e ::= number .
    Perl Closure: Y; Token: '2'
Tree Node: S5@0-2R0:2[0]; Depth = 2; Rhs Length = 3
    Rule: e ::= e op . e
    Kernel: S3@0-1R0:1
    Perl Closure: N; Token: '+'
Tree Node: S3@0-1R0:1[0]; Depth = 3; Rhs Length = 3
    Rule: e ::= e . op e
    Closure: S4@0-1L0
    Perl Closure: N
Tree Node: S4@0-1L0[0]; Depth = 4; Rhs Length = 1
    Rule: e ::= number .
    Perl Closure: Y; Token: '2'

trace_values Output

Pushed value: '2'
Popping 1 values to evaluate S4@0-1L0, rule: 1: e -> number
Calculated and pushed value: '2==2'
Pushed value: '+'
Pushed value: '2'
Popping 1 values to evaluate S4@2-3L0, rule: 1: e -> number
Calculated and pushed value: '2==2'
Popping 3 values to evaluate S6@0-3L0, rule: 0: e -> e op e
Calculated and pushed value: '(2+2)==4'
Popping 1 values to evaluate S2@0-3L3, rule: 2: e['] -> e
Calculated and pushed value: '(2+2)==4'
(2+2)==4

SUPPORT

See the support section in the main module.

AUTHOR

Jeffrey Kegler

LICENSE AND COPYRIGHT

Copyright 2007 - 2008 Jeffrey Kegler

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.