NAME
Marpa::XS::Debug - Marpa Grammar Debugging
OVERVIEW
This document describes Marpa's more powerful general-use tracing and debugging techniques in detail. It assumes that you have written a grammar for your Marpa application, and that something is going wrong. It is a very good idea to read the overview document for tracing problems before reading this document.
DESCRIPTION
To read the show_progress
output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section.
Dotted Rules
The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.
Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's perly.y
termbinop ::= term ASSIGNOP term
In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols.
Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule.
Here's our rule with a dot location indicated:
termbinop ::= · term ASSIGNOP term
The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a prediction.
Here's another dotted rule:
termbinop ::= term · ASSIGNOP term
In this dotted rule, we are saying we have seen a term
, but have not yet recognized an ASSIGNOP
. (ASSIGNOP
is perly.y
's internal name for the assignment operator. In plain Perl terms, this is the "=
" character.)
There's another special kind of dotted rule, a completion. A completion is a dotted rule with the dot after all of the symbols. Here's the completed version of the example we've been using:
termbinop ::= term ASSIGNOP term ·
A completion indicates that a rule has been fully recognized.
Earley Items
In dotted rules, Earley's algorithm has all but one piece of the information it needs to track. The final piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.
A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location is the location in the input stream which corresponds to the dot position.
Two noteworthy consequences follow from way in which origin and current location are defined. First, if a dotted rule is a prediction, origin and current location will always be the same. Second, the input stream location where a rule ends is not tracked unless the dotted rule is a completion. In other cases, we don't know if the rule will be completed, much less at which location.
THE BASE EXAMPLE
For this example of debugging, I've taken a very common example of a grammar and deliberately introduced a problem. (All the code and the full trace outputs for this base example are in the Appendix.) I've commented out the correct start rule:
## { lhs => 'Expression', rhs => [qw/Term/] },
and replaced it with another start rule, one which will cause problems:
{ lhs => 'Expression', rhs => [qw/Factor/] },
In what follows, we'll pretend we don't already know where the problem is, and use the Marpa diagnostics and tracing facilities to "discover" it.
AN EARLY WARNING
Right off the bat, we get two warning messages:
Inaccessible symbol: Add
Inaccessible symbol: Term
If we were alert, these would be enough to tell us there is a serious problem. "Inaccessible" symbols are symbols which cannot be reached from the start symbol. This means that the grammar will never produce them, and that parses will never find them in the input.
Since Add
and Term
are both important symbols in our application, that should tell us our grammar has a serious problem. In fact, these warning messages would often be enough to point us to the error. But, in order to look at more of Marpa's tracing facilities, let's pretend we have not had our morning coffee, and that we miss the significance of these warning messages.
THE trace_terminals OUTPUT
Before looking at Marpa's progress reports, it usually best to orient yourself by looking at the output from trace_terminals
. Typically, you will be interested in what were the last tokens to be accepted, and perhaps what tokens the recognizer was looking for when it didn't find what it wanted. Sometimes that information alone is enough to make it clear where the problem is.
The full trace_terminals
output for this base example is in the Appendix. We see that the recognizer seems to accept "42*1
" but it fails when it confronts the plus sign ("+
"). The last two lines are:
Accepted "Number" at 2-3
Expecting "Multiply" at 3
A note in passing: Marpa shows the location of the tokens it accepts as a range of locations. For Number
, the range is "2-3
", indicating that earleme 2 is the start location and earleme 3 is the end location. That level of detail will seem like overkill in ordinary applications, where every token has length 1. But Marpa allows other input models, and in those models the information about start and end location of the token is important.
Returning to the problem at hand: debugging the grammar. We notice that at earleme 3 we are expecting a Multiply
operator, but not an Add
operator. That should strike us as strange, and send us back to the grammar. But for the sake of our example we will assume that we are slow on the uptake today, and that this does not clue us in. We move on.
THE show_progress OUTPUT
The last tool that you should ever need for debugging Marpa grammars are its progress reports. The progress reports show Earley items in their original form. Marpa does not actually use Earley's original items directly. But they underlie Marpa's algorithm and Marpa reconstructs them for show_progress
.
In the Appendix, progress reports for the entire parse are shown. Our base example in this document is a very small one, so that's a reasonable thing to do in this case. If a parse is at all large, you'll often need to be selective.
Of most interest is the progress report for the "current" Earley set. In our example that's earleme 3. You can find out the current earleme from the return value of either the tokens
method or the status
method. But it may be easier to simply call show_progress
without arguments. By default, show_progress
prints out only the progress reports for the current earleme.
Here are the progress reports for the current earleme, earleme 3, from our example.
BUILDING @0-3 Factor -> Factor . Multiply Factor
BUILDING @2-3 Factor -> Factor . Multiply Factor
COMPLETED @0-3 0: Expression -> Factor
COMPLETED @2-3 2: Factor -> Number
COMPLETED @0-3 4: Factor -> Factor Multiply Factor
COMPLETED @0-3 5: Expression['] -> Expression
As an aside, we see that there is a completed rule for Expression[']
, which is Marpa's special start symbol. That indicates that if our input ended at earleme 3, it would be a valid sentence in the language of our grammar.
More relevant for our purposes, we see that we have completed rules for Expression
, and Factor
, as expected. We also see two Earley items that show that we are in the process of building another Factor
, and that it is expecting a Multiply
symbol. This is not the rule we want, but it explains why the trace_terminals
output showed that the recognizer was expecting a Multiply
symbol.
What we want to know is, why is the recognizer not expecting an Add
symbol? Looking back at the grammar, we see that only one rule uses the Add
symbol: the rule "Term ::= Term Add Term
". The next step is to look at the Earley items for this rule. But there is a problem. We don't find any.
Next, we ask ourselves, what is the earliest place the "Term ::= Term Add Term
" rule should be appearing? The answer is that there should be a prediction of "Term ::= Term Add Term
" at earleme 0. So we look at the predictions at earleme 0.
PREDICTING @0 0: Expression -> Factor
PREDICTING @0 2: Factor -> Number
PREDICTING @0 4: Factor -> Factor Multiply Factor
PREDICTING @0 5: Expression['] -> Expression
No "Term ::= Term Add Term
" rule. We are not even predicting a "Term ::= Term Add Term
" rule. We look back at the grammar, and start from the beginning.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
Our special start symbol is Expression[']
and we do see a prediction for the rule with Expression[']
on the left hand side. It in turn predicts an Expression
symbol, and we do see a prediction for the rule with Expression
on the left hand side. Expression
in turn predicts a Factor
symbol, and we do see a prediction for both of the rules with Factor
on the left hand side.
But we don't see anything predicting Term
. And then it hits us. If we start at the start symbol, and follow the rules of our grammar, we will never get to a Term
symbol. That's what the warning message was trying to tell us.
This makes us reread our grammar, and see that out Expression ::= Factor
rule is wrong. It should be Expression ::= Term
. We have found the source of our problem.
COMPLICATIONS
Internally, Marpa rewrites Earley items and grammars. show_progress
hides most of this from the user. But some aspects of Marpa's rewrites are relevant and useful to know.
Special Symbols
Marpa uses a few special symbols internally which it is useful for the the user of show_progress
to be aware of. To distinguish them, Marpa's internal symbols end in a right square bracket ("]
"). No user-defined symbol is allowed to end in a right square bracket.
One of these special symbols is Marpa's special start symbol, which always ends in "[']
". Marpa augments all of its grammars with a special start rule, which will have the special start symbol on its left hand side. We saw this above with the Expression['] ::= Expression
rule.
If the empty, or null, string is a sentence in the language of the grammar, Marpa will add a special empty start rule. The special empty start rule will have its own special null start symbol on its left hand side. The special null start symbol ends in "['][]
".
Empty Rules
If you have empty rules in your grammar, you may notice that they never appear in the traces. That is because Marpa allows at most one empty rule in any grammar, and that rule is the one Marpa adds itself: the special empty start rule. Marpa removes all empty and nulling rules in the original grammar.
Marpa's removal of nulling rules is recursive, at it needs to be. Removing rules that are nulling reveals that the left hand side symbol of those rules is also nulling. This in turn can reveal other nulling rules.
Sequences
Marpa allows the user to explicitly specify sequences, rather than write them out in BNF. Marpa is able to optimize explicitly specified sequences. For the actual parsing, Marpa rewrites sequences into BNF.
In the Earley items, the rules will have been translated into BNF, so that is how they appear in show_progress
. Marpa's rewritten sequence rules take much the same form that a programmer's rewritten rules would, if she had to do the rewrite by hand.
Here's are the rules of a Marpa grammar, with a sequence:
my $grammar = Marpa::XS::Grammar->new(
{ start => 'Document',
strip => 0,
lhs_terminals => 0,
rules => [
{ lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
],
}
);
And here is how Marpa translates this sequence:
PREDICTING @0 0: Document -> Stuff[Seq:1-*]
PREDICTING @0 1: Stuff[Seq:1-*] -> Stuff
PREDICTING @0 2: Stuff[Seq:1-*] -> Stuff[Seq:1-*] Stuff
Leo Items
In the original Earley's algorithm, left recursion was very efficient, but right recursion required a completion Earley item for every step of the recursion in every Earley set where the right recursion might end. That made right recursion take O(n**2) time and space, where n was the length of the right recursion.
Joop Leo figured out a way to get around this. Leo noted that most of the completions were trivial, in the sense that their only purpose was to create the topmost completion in a chain. He figured out how to get directly to the topmost completion and let that topmost completion stand in for the chain of trivial completions that produced it. This reduced the time and space required to O(n). Marpa uses Leo's methods.
The topmost completions are labeled "(Leo)
" as we'll see below. These topmost completions stand in for zero or more trivial completions in the same Earley set.
Here is a example. First, the right recursive grammar:
my $grammar = Marpa::XS::Grammar->new(
{ start => 'S',
strip => 0,
lhs_terminals => 0,
rules => [
{ lhs => 'S', rhs => [qw/Top_sequence/] },
{ lhs => 'Top_sequence', rhs => [qw/Top Top_sequence/] },
{ lhs => 'Top_sequence', rhs => [qw/Top/] },
{ lhs => 'Top', rhs => [ qw/Upper_Middle/ ] },
{ lhs => 'Upper_Middle', rhs => [ qw/Lower_Middle/ ] },
{ lhs => 'Lower_Middle', rhs => [ qw/Bottom/ ] },
{ lhs => 'Bottom', rhs => [ qw/T/ ] },
],
}
);
And here is the progress report at earleme 20:
COMPLETED @0-20 0: S -> Top_sequence
COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence
COMPLETED @19-20 2: Top_sequence -> Top
COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle
COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle
COMPLETED @19-20 6: Bottom -> T
COMPLETED @0-20 7: S['] -> S
The First Leo Topmost Completion
In these reports we see three Leo topmost completions. Let's look at the first one:
COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence
This is the top of a chain of 19 Earley items, all for the same rule: "Top_sequence ::= Top Top_sequence
". This topmost one spans the earleme range from 0 to 20 ("@0-20
"). The other 18 in the chain are trivial completions, completions whose only purpose is to produce the topmost completion. The next one down in the chain (not shown in the show_progress
output) is identical, except for the origin, and so would look like this:
COMPLETED @1-20 1: Top_sequence -> Top Top_sequence
Similarly, the remaining 17 Earley items would differ from the topmost one only in their origin, each one starting one earleme later than the one above it. The bottom of the chain is shown in the show_progress
output and is a completion of a different rule:
COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle
This example illustrates the improvement introduced by the Leo method. In this example, without the Leo methods, there would have been 20 items in the chain at earleme 20. At earleme 100 there would be 100 items, and so forth -- at earleme n there would be n items, all but two of them trivial.
By earleme 100, there would have been a total (in all Earley sets up to that point) of over 4800 trivial Earley items. By earleme 1000, there would have been over 498,000, which would have made parsing with this grammar impractical for many applications. With the Leo methods, the number of Earley items in each set stays below a small constant.
The Second Leo Topmost Completion
The second Leo topmost completion was the bottom of the chain for the first topmost completion. It illustrates something to watch for:
COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle
The chain of completions which the topmost item replaces can contain zero trivial completions. In other words, sometimes no completions are actually eliminated. That is the case here. This topmost completion stands in only for itself.
The Third Leo Topmost Completion
Our third Leo topmost completion illustrates a final point to watch for:
COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle
Unlike the second Leo completion, this one does stand in for an trivial completion, one that was eliminated. If this trivial completion had appeared in the show_progress
output, it would have looked like this:
COMPLETED @19-20 4: Lower_Middle -> Bottom
For the first Leo topmost completion, all the trivial completions that were eliminated were for the same rule as the topmost completion. That is not always the case. Leo's methods catches and optimizes right recursion, even when it is indirect, and passes through a series of different rules. In this example, the trivial completion which was eliminated is not for the same rule as the rule in the Leo topmost completion which stands in for it.
Futures
In future versions of Marpa, show_progress
may say more about with the missing trivial completions. It is not yet clear how much information, and what kind of information, to provide. Simply printing out all the eliminated, trivial completions does not seem to really be an option. When a right recursion is thousands of steps long, which it can easily be, there will often be millions of trivial completions. Nonetheless, more can be done in letting the user know about eliminated, trivial completions.
APPENDIX: FULL TRACES AND CODE FOR THE BASE EXAMPLE
Below are the code, the trace outputs and the progress report for the base example used in this document.
Code for the example
my $grammar = Marpa::XS::Grammar->new(
{ start => 'Expression',
actions => 'My_Actions',
default_action => 'first_arg',
strip => 0,
rules => [
## This is a deliberate error in the grammar
## The next line should be:
## { lhs => 'Expression', rhs => [qw/Term/] },
## I have changed the Term to 'Factor' which
## will cause problems.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
}
);
$grammar->precompute();
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', q{*} ],
[ 'Number', 1 ],
[ 'Add', q{+} ],
[ 'Number', 7 ],
);
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $recce = Marpa::XS::Recognizer->new(
{ grammar => $grammar, trace_terminals => 2, mode => 'stream' } );
my $token_ix = 0;
my ( $current_earleme, $expected_tokens ) =
$recce->tokens( \@tokens, \$token_ix );
if ( $token_ix <= $#tokens ) {
$progress_report = $recce->show_progress( 0, $current_earleme );
}
Trace Output
Inaccessible symbol: Add
Inaccessible symbol: Term
Setting trace_terminals option
Expecting "Number" at earleme 0
Expecting "Expression" at earleme 0
Expecting "Factor" at earleme 0
Accepted "Number" at 0-1
Expecting "Multiply" at 1
Accepted "Multiply" at 1-2
Expecting "Number" at 2
Expecting "Factor" at 2
Accepted "Number" at 2-3
Expecting "Multiply" at 3
Rejected "Add" at 3
Progress Output
PREDICTING @0 0: Expression -> Factor
PREDICTING @0 2: Factor -> Number
PREDICTING @0 4: Factor -> Factor Multiply Factor
PREDICTING @0 5: Expression['] -> Expression
BUILDING @0-1 Factor -> Factor . Multiply Factor
COMPLETED @0-1 0: Expression -> Factor
COMPLETED @0-1 2: Factor -> Number
COMPLETED @0-1 5: Expression['] -> Expression
PREDICTING @2 2: Factor -> Number
PREDICTING @2 4: Factor -> Factor Multiply Factor
BUILDING @0-2 Factor -> Factor Multiply . Factor
BUILDING @0-3 Factor -> Factor . Multiply Factor
BUILDING @2-3 Factor -> Factor . Multiply Factor
COMPLETED @0-3 0: Expression -> Factor
COMPLETED @2-3 2: Factor -> Number
COMPLETED @0-3 4: Factor -> Factor Multiply Factor
COMPLETED @0-3 5: Expression['] -> Expression
LICENSE AND COPYRIGHT
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.