NAME
Marpa::XS::Debug - Marpa Grammar Debugging
OVERVIEW
This document describes Marpa's more powerful general-use tracing and debugging techniques in detail. It assumes that you have written a grammar for your Marpa application, and that something is going wrong. It is a very good idea to read the overview document for tracing problems before reading this document.
DESCRIPTION
To read the show_progress
output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section.
Dotted Rules
The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.
Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's perly.y
termbinop -> term ASSIGNOP term
ASSIGNOP
is perly.y
's internal name for the assignment operator. In plain Perl terms, this is the "=
" character.
In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols.
Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule.
Here's our rule with a dot location indicated:
termbinop -> · term ASSIGNOP term
The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a prediction or a predicted rule.
Here's another dotted rule:
termbinop -> term · ASSIGNOP term
In this dotted rule, we are saying we have seen a term
, but have not yet recognized an ASSIGNOP
.
There's another special kind of dotted rule, a completion. A completion (also called a completed rule) is a dotted rule with the dot after all of the symbols. Here is the completion for the rule that we have been using as an example:
termbinop -> term ASSIGNOP term ·
A completion indicates that a rule has been fully recognized.
Earley Items
In dotted rules, Earley's algorithm has all but one piece of the information it needs to track. The final piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.
A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location (also called the dot location) is the location in the input stream which corresponds to the dot position.
Two noteworthy consequences follow from the way in which origin and current location are defined. First, if a dotted rule is a prediction, then origin and current location will always be the same. Second, the input stream location where a rule ends is not tracked unless the dotted rule is a completion. In other cases, we don't know if the rule will ever be completed, much less at which location.
THE EXAMPLE
For this example of debugging, I've taken a very common example of a grammar and deliberately introduced a problem. (All the code and the full trace outputs for this example are in the Appendix.) I've commented out the correct start rule:
## { lhs => 'Expression', rhs => [qw/Term/] },
and replaced it with another start rule, one which will cause problems:
{ lhs => 'Expression', rhs => [qw/Factor/] },
In what follows, we'll pretend we don't already know where the problem is, and use the Marpa diagnostics and tracing facilities to "discover" it.
THE FIRST WARNING
Right off the bat, we get two warning messages:
Inaccessible symbol: Add
Inaccessible symbol: Term
If we were alert, these would be enough to tell us there is a serious problem. "Inaccessible" symbols are symbols which cannot be reached from the start symbol. This means that the grammar will never produce them, and that parses will never find them in the input.
Since Add
and Term
are both important symbols in our application, that should tell us our grammar has a serious problem. In fact, these warning messages would often be enough to point us to the error. But, in order to look at more of Marpa's tracing facilities, let's pretend we have not had our morning coffee, and that we miss the significance of these warning messages.
THE trace_terminals OUTPUT
Before looking at Marpa's progress reports, it is usually best to orient yourself by looking at the output from trace_terminals
. Typically, you will be interested in what were the last tokens to be accepted, and perhaps what tokens the recognizer was looking for when it didn't find what it wanted. Sometimes that information alone is enough to make it clear where the problem is.
The full trace_terminals
output for this example is in the Appendix. We see that the recognizer seems to accept "42*1
" but it fails when it confronts the plus sign ("+
"). The last two lines are:
Accepted "Number" at 2-3
Expecting "Multiply" at 3
A note in passing: Marpa shows the location of the tokens it accepts as a range of locations. For Number
, the range is "2-3
", indicating that earleme 2 is the start location and earleme 3 is the end location. That level of detail will seem like overkill in ordinary applications, where every token has length 1. But Marpa allows other input models, and in those models the information about start and end location of the token is important.
Returning to the problem at hand: We notice that at earleme 3 we are expecting a Multiply
operator, but not an Add
operator. That should strike us as strange, and send us back to the grammar. But for the sake of our example we will assume that we are slow on the uptake today, and that this does not clue us in. We move on.
THE show_progress OUTPUT
Marpa's most powerful tool for debugging grammars is its progress report, which shows the Earley items being worked on. In the Appendix, progress reports for the entire parse are shown. Our example in this document is a very small one, so that producing progress reports for the entire parse is a reasonable thing to do in this case. If a parse is at all large, you will usually need to be selective.
The progress report that is usually of most interest is the one for the Earley set that you were working on when the error occurred. This is called the "current earleme". In our example the current earleme is earleme 3. You can find out the current earleme using the current_earleme
method. But you can also simply call show_progress
without arguments. By default, show_progress
prints out only the progress reports for the current earleme.
Here are the progress reports for the current earleme, earleme 3, from our example.
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
F5 @0-3 Expression['] -> Expression .
Progress Report Lines
The last field of each Progress Report line shows, in fully expanded form, the dotted rule we were working on. Since that is the most important information, it may be tempting to skip the rest of this section, and move directly forward with the debugging.
In fact, you might want to do exactly that -- skip to the beginning of the next section. What follows talks about the details of the format of the first few fields in each progress report line. These first few fields, while helpful, are also usually one or more of obvious in their meaning, not relevant to our example, and repetitive of information which can be deduced from other fields.
F5 @0-3 Expression['] -> Expression .
Prefixed to the dotted rule are two fields: "F5 @0-3
". The "F5
" says that this is a completed or final rule, and that it is rule number 5. The rule number is used in other tracing and debugging output, when displaying the whole rule would take too much space. In what follows we won't need the rule number.
The "@0-3
" describes the location of the dotted rule in the parse. In its simplest form, the location field is two earleme numbers, separated by a hyphen. The first earleme number is the origin, the place where Marpa first started recognizing the rule. The last earleme number is the dot earleme, the earleme location of the dot in a dotted rule. "@0-3
" say that this rule began at earleme 0, and that the dot is at location 3.
Earleme 3 is the current earleme and this is no coincidence. Whenever we are displaying the progress report for an earleme, all the progress report lines will have their dot location at that earleme.
As an aside, notice that the left hand side symbol is Expression[']
. That is is Marpa's special start symbol. The presence of a completed start rule in our progress report indicates that if our input ended at earleme 3, it would be a valid sentence in the language of our grammar.
Let's look at another progress report line:
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
Here the "R4:1
" indicates that this is rule number 4 (the "R
" stands for rule number) and that its dot position is after the first symbol on the right hand side. Symbol positions are numbered using the ordinal of the symbol just before the position. Symbols are numbered starting with 1, and symbol position 1 is the position immediately after symbol 1.
The next field ("x2
") is new. It is a count. A progress report can contain multiple instances of the same dotted rule, and when there is more than one, a count field is included in the progress report line. Here the "x2
" indicates that there are two instances of Factor -> Factor . Multiply Factor
at this earleme.
Multiple instances of a dotted rule will differ in their origin, and where they do, this is shown in the location field of the progress report line. Here the location field is "@0,2-3
", which indicates that one instance of this dotted rule has its origin at earleme 0, and the other has its origin at earleme 2. All instances reported on a single progress report line will always have the same dot location, and in this case it is earleme 3.
Predicted rules also appear in progress reports:
P2 @2-2 Factor -> . Number
Here the "P
" in the summary field means "predicted". As with much of the information in the summary field, this only repeats what is obvious from the full expansion of the dotted rule later in the line. But final (or completed) and predicted rules can be important and the initial "F"
and "P"
make these lines easy to spot.
Notice that in the predicted rule, the origin is the same as the dot earleme. This will always be the case with predicted rules.
For any given earleme, no predicted rule has more than one instance. For other dotted rules, there may be many instances of the dotted rule at a single earleme. In grammars with right recursion, the number of instances is limited only by the length of the recursion. The length of a recursion is limited primarily by the available memory.
When there are many instances of a dotted rule at a single earleme location, it may be inconvenient to show all the origins in a comma-separated list. In that case the origins in the location field are shown as a range, with the earliest separated from the most recent by a "...
". The example is this document contains no lines with a large number of instances, but here is an example from another grammar. This is the progress report line for the completed rule in a right recursion of length 20.
F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .
OK! Now to Find the Bug
Here again are progress reports at the earleme where things went wrong:
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
F5 @0-3 Expression['] -> Expression .
We see that we have completed rules for Expression
, and Factor
, as expected. We also see two Earley items that show that we are in the process of building another Factor
, and that it is expecting a Multiply
symbol. This is not the rule we want, but it explains why the trace_terminals
output showed that the recognizer was expecting a Multiply
symbol.
What we want to know is, why is the recognizer not expecting an Add
symbol? Looking back at the grammar, we see that only one rule uses the Add
symbol: the rule "Term -> Term Add Term
". The next step is to look at the Earley items for this rule. But there is a problem. We don't find any.
Next, we ask ourselves, what is the earliest place the "Term -> Term Add Term
" rule should be appearing? The answer is that there should be a prediction of "Term -> Term Add Term
" at earleme 0. So we look at the predictions at earleme 0.
P0 @0-0 Expression -> . Factor
P2 @0-0 Factor -> . Number
P4 @0-0 Factor -> . Factor Multiply Factor
P5 @0-0 Expression['] -> . Expression
No "Term -> Term Add Term
" rule. We are never even predicting a "Term -> Term Add Term
" rule. We look back at the grammar, and start from the beginning.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
Our special start symbol is Expression[']
and we do see a rule with Expression[']
on the left hand side. This rule in turn produces an Expression
symbol, and there is a rule with Expression
on the left hand side. Expression
in turn produces a Factor
symbol, and there are two rules with Factor
on the left hand side.
But none of these rules ever produce a Term
. In fact, however far we follow the productions, no rule ever produces a Term
. At this point we see the problem: If we start at the start symbol, and follow the rules of our grammar, we will never get to a Term
symbol. Which is exactly what that first warning message was saying.
Now that we know what is wrong, we can reread our grammar, and see that our Expression -> Factor
rule is wrong. It should be Expression -> Term
. Change that and the problem is fixed.
COMPLICATIONS
Internally, Marpa rewrites Earley items and grammars. show_progress
hides most of this from the user. But some aspects of Marpa's rewrites are relevant or useful to know.
Special Symbols
Marpa uses a few special symbols internally which it is useful for the user of show_progress
to be aware of. To distinguish them, Marpa's internal symbols end in a right square bracket ("]
"). No user-defined symbol is allowed to end in a right square bracket.
One of these special symbols is Marpa's special start symbol, which always ends in "[']
". Marpa augments all of its grammars with a special start rule, which will have the special start symbol on its left hand side. We saw this above with the Expression['] -> Expression
rule.
If the empty, or null, string is a sentence in the language of the grammar, Marpa will add a special empty start rule. The special empty start rule will have its own special null start symbol on its left hand side. The special null start symbol ends in "['][]
".
Empty Rules
Marpa removes all of the empty and nulling rules in the original grammar. Internally, Marpa marks symbols as nulling and this produces the same result much more efficiently. Outwardly, the effect is the same, so much so that you might not even notice the absence of the original grammar's nulling and empty rules from the progress reports.
To be precise, Marpa allows exactly one nulling rule, but this is a rule that it creates internally, not one specified by the user. If the grammar accepts the null string as valid input, Marpa creates a nulling start rule.
Marpa's removal of nulling rules is recursive, as it needs to be. Removing rules that are nulling reveals that the left hand side symbol of those rules is also nulling. This in turn can reveal other nulling rules.
Sequences
Marpa allows the user to explicitly specify sequences, rather than write them out in BNF. Marpa is able to optimize explicitly specified sequences. For the actual parsing, Marpa rewrites sequences into BNF.
In the Earley items, the rules will have been translated into BNF, so that is how they appear in show_progress
. Marpa's rewritten sequence rules take much the same form that a programmer's rewritten rules would, if she had to do the rewrite by hand.
Here's are the rules of a Marpa grammar, with a sequence:
my $grammar = Marpa::XS::Grammar->new(
{ start => 'Document',
strip => 0,
lhs_terminals => 0,
rules => [
{ lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
],
}
);
And here is how Marpa translates this sequence:
P1 @0-0 Document -> . Document[Subseq:0:1]
P2 @0-0 Document[Subseq:0:1] -> . Stuff
P3 @0-0 Document[Subseq:0:1] -> . Document[Subseq:0:1] Stuff
P4 @0-0 Document['] -> . Document
APPENDIX: FULL CODE AND OUTPUT FOR THE EXAMPLE
Below are the code, the trace outputs and the progress report for the example used in this document.
Code
my $grammar = Marpa::XS::Grammar->new(
{ start => 'Expression',
actions => 'My_Actions',
default_action => 'first_arg',
strip => 0,
rules => [
## This is a deliberate error in the grammar
## The next line should be:
## { lhs => 'Expression', rhs => [qw/Term/] },
## I have changed the Term to 'Factor' which
## will cause problems.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
}
);
$grammar->precompute();
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', q{*} ],
[ 'Number', 1 ],
[ 'Add', q{+} ],
[ 'Number', 7 ],
);
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $recce = Marpa::XS::Recognizer->new(
{ grammar => $grammar, trace_terminals => 2 } );
my $token_ix = 0;
TOKEN: for my $token_and_value (@tokens) {
last TOKEN if not defined $recce->read( @{$token_and_value} );
}
my $current_earleme = $recce->current_earleme();
$progress_report = $recce->show_progress( 0, $current_earleme );
Trace Output
Inaccessible symbol: Add
Inaccessible symbol: Term
Setting trace_terminals option
Expecting "Expression" at earleme 0
Expecting "Factor" at earleme 0
Expecting "Number" at earleme 0
Accepted "Number" at 0-1
Expecting "Multiply" at 1
Accepted "Multiply" at 1-2
Expecting "Factor" at 2
Expecting "Number" at 2
Accepted "Number" at 2-3
Expecting "Multiply" at 3
Rejected "Add" at 3-4
Progress Output
P0 @0-0 Expression -> . Factor
P2 @0-0 Factor -> . Number
P4 @0-0 Factor -> . Factor Multiply Factor
P5 @0-0 Expression['] -> . Expression
F0 @0-1 Expression -> Factor .
F2 @0-1 Factor -> Number .
R4:1 @0-1 Factor -> Factor . Multiply Factor
F5 @0-1 Expression['] -> Expression .
P2 @2-2 Factor -> . Number
P4 @2-2 Factor -> . Factor Multiply Factor
R4:2 @0-2 Factor -> Factor Multiply . Factor
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
F5 @0-3 Expression['] -> Expression .
COPYRIGHT AND LICENSE
Copyright 2011 Jeffrey Kegler
This file is part of Marpa::XS. Marpa::XS is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::XS is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::XS. If not, see
http://www.gnu.org/licenses/.