=head1 NAME Marpa::Debug - Marpa Grammar Debugging =head1 OVERVIEW This document describes Marpa's more powerful general-use tracing and debugging techniques in detail. It assumes that you have written a grammar for your Marpa application, and that something is going wrong. It is a very good idea to read L<the overview document for tracing problems|Marpa::Tracing> before reading this document. =head1 DESCRIPTION To read the L<C<show_progress>|Marpa::Recognizer/"show_progress"> output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section. =head2 Dotted Rules The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream. Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's C<perly.y> S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= term ASSIGNOP term>> In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols. Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the B<dot location>. Taken as a pair, a rule and a dot location are called a B<dotted rule>. Here's our rule with a dot location indicated: S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= E<middot> term ASSIGNOP term>> The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a B<prediction>. Here's another dotted rule: S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= term E<middot> ASSIGNOP term>> In this dotted rule, we are saying we have seen a C<term>, but have not yet recognized an C<ASSIGNOP>. (C<ASSIGNOP> is C<perly.y>'s internal name for the assignment operator. In plain Perl terms, this is the "C<=>" character.) There's another special kind of dotted rule, a completion. A B<completion> is a dotted rule with the dot after all of the symbols. Here's the completed version of the example we've been using: S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= term ASSIGNOP term E<middot>>> A completion indicates that a rule has been fully recognized. =head2 Earley Items In dotted rules, Earley's algorithm has all but one piece of the information it needs to track. The final piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items. A convenient way to think of an B<Earley item> is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The B<origin> is the location in the input stream where the dotted rule starts. The B<current location> is the location in the input stream which corresponds to the dot position. Two noteworthy consequences follow from way in which origin and current location are defined. First, if a dotted rule is a prediction, origin and current location will always be the same. Second, the input stream location where a rule ends is not tracked unless the dotted rule is a completion. In other cases, we don't know if the rule will be completed, much less at which location. =head1 THE BASE EXAMPLE For this example of debugging, I've taken a very common example of a grammar and deliberately introduced a problem. (All the code and the full trace outputs for this base example are in the Appendix.) I've commented out the correct start rule: =for Marpa::Display name: Debug Example Part 1 partial: 1 normalize-whitespace: 1 ## { lhs => 'Expression', rhs => [qw/Term/] }, =for Marpa::Display::End and replaced it with another start rule, one which will cause problems: =for Marpa::Display name: Debug Example Part 1 partial: 1 normalize-whitespace: 1 { lhs => 'Expression', rhs => [qw/Factor/] }, =for Marpa::Display::End In what follows, we'll pretend we don't already know where the problem is, and use the Marpa diagnostics and tracing facilities to "discover" it. =head1 AN EARLY WARNING Right off the bat, we get two warning messages: =for Marpa::Display name: Debug Example Trace Output partial: 1 normalize-whitespace: 1 Inaccessible symbol: Add Inaccessible symbol: Term =for Marpa::Display::End If we were alert, these would be enough to tell us there is a serious problem. "Inaccessible" symbols are symbols which cannot be reached from the start symbol. This means that the grammar will never produce them, and that parses will never find them in the input. Since C<Add> and C<Term> are both important symbols in our application, that should tell us our grammar has a serious problem. In fact, these warning messages would often be enough to point us to the error. But, in order to look at more of Marpa's tracing facilities, let's pretend we have not had our morning coffee, and that we miss the significance of these warning messages. =head1 THE trace_terminals OUTPUT Before looking at Marpa's progress reports, it usually best to orient yourself by looking at the output from L<C<trace_terminals>|Marpa::Recognizer/trace_terminals>. Typically, you will be interested in what were the last tokens to be accepted, and perhaps what tokens the recognizer was looking for when it didn't find what it wanted. Sometimes that information alone is enough to make it clear where the problem is. The full L<C<trace_terminals>|Marpa::Recognizer/trace_terminals> output for this base example is in the Appendix. We see that the recognizer seems to accept "C<42*1>" but it fails when it confronts the plus sign ("C<+>"). The last two lines are: =for Marpa::Display name: Debug Example Trace Output partial: 1 normalize-whitespace: 1 Accepted "Number" at 2-3 Expecting "Multiply" at 3 =for Marpa::Display::End A note in passing: Marpa shows the location of the tokens it accepts as a range of locations. For C<Number>, the range is "C<2-3>", indicating that earleme 2 is the start location and earleme 3 is the end location. That level of detail will seem like overkill in ordinary applications, where every token has length 1. But Marpa allows other input models, and in those models the information about start and end location of the token is important. Returning to the problem at hand: debugging the grammar. We notice that at earleme 3 we are expecting a C<Multiply> operator, but not an C<Add> operator. That should strike us as strange, and send us back to the grammar. But for the sake of our example we will assume that we are slow on the uptake today, and that this does not clue us in. We move on. =head1 THE show_progress OUTPUT The last tool that you should ever need for debugging Marpa grammars are its progress reports. The progress reports show Earley items in their original form. Marpa does not actually use Earley's original items directly. But they underlie Marpa's algorithm and Marpa reconstructs them for L<C<show_progress>|Marpa::Recognizer/"show_progress">. In the Appendix, progress reports for the entire parse are shown. Our base example in this document is a very small one, so that's a reasonable thing to do in this case. If a parse is at all large, you'll often need to be selective. Of most interest is the progress report for the "current" Earley set. In our example that's earleme 3. You can find out the current earleme from the return value of either the L<C<tokens> method|Marpa::Recognizer/"tokens"> or the L<C<status> method|Marpa::Recognizer/"status">. But it may be easier to simply call L<C<show_progress>|Marpa::Recognizer/"show_progress"> without arguments. By default, L<C<show_progress>|Marpa::Recognizer/"show_progress"> prints out only the progress reports for the current earleme. Here are the progress reports for the current earleme, earleme 3, from our example. =for Marpa::Display name: Debug Example Progress Report partial: 1 normalize-whitespace: 1 BUILDING @0-3 Factor -> Factor . Multiply Factor BUILDING @2-3 Factor -> Factor . Multiply Factor COMPLETED @0-3 0: Expression -> Factor COMPLETED @2-3 2: Factor -> Number COMPLETED @0-3 4: Factor -> Factor Multiply Factor COMPLETED @0-3 5: Expression['] -> Expression =for Marpa::Display::End As an aside, we see that there is a completed rule for C<Expression[']>, which is Marpa's special start symbol. That indicates that if our input ended at earleme 3, it would be a valid sentence in the language of our grammar. More relevant for our purposes, we see that we have completed rules for C<Expression>, and C<Factor>, as expected. We also see two Earley items that show that we are in the process of building another C<Factor>, and that it is expecting a C<Multiply> symbol. This is not the rule we want, but it explains why the C<trace_terminals> output showed that the recognizer was expecting a C<Multiply> symbol. What we want to know is, why is the recognizer B<not> expecting an C<Add> symbol? Looking back at the grammar, we see that only one rule uses the C<Add> symbol: the rule "C<Term ::= Term Add Term>". The next step is to look at the Earley items for this rule. But there is a problem. We don't find any. Next, we ask ourselves, what is the earliest place the "C<Term ::= Term Add Term>" rule should be appearing? The answer is that there should be a prediction of "C<Term ::= Term Add Term>" at earleme 0. So we look at the predictions at earleme 0. =for Marpa::Display name: Debug Example Progress Report partial: 1 normalize-whitespace: 1 PREDICTING @0 0: Expression -> Factor PREDICTING @0 2: Factor -> Number PREDICTING @0 4: Factor -> Factor Multiply Factor PREDICTING @0 5: Expression['] -> Expression =for Marpa::Display::End No "C<Term ::= Term Add Term>" rule. We are not even predicting a "C<Term ::= Term Add Term>" rule. We look back at the grammar, and start from the beginning. =for Marpa::Display: name: Debug Example Part 1 partial: 1 normalize-whitespace: 1 { lhs => 'Expression', rhs => [qw/Factor/] }, { lhs => 'Term', rhs => [qw/Factor/] }, { lhs => 'Factor', rhs => [qw/Number/] }, { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' }, { lhs => 'Factor', rhs => [qw/Factor Multiply Factor/], action => 'do_multiply' }, =for Marpa::Display::End Our special start symbol is C<Expression[']> and we do see a prediction for the rule with C<Expression[']> on the left hand side. It in turn predicts an C<Expression> symbol, and we do see a prediction for the rule with C<Expression> on the left hand side. C<Expression> in turn predicts a C<Factor> symbol, and we do see a prediction for both of the rules with C<Factor> on the left hand side. But we don't see anything predicting C<Term>. And then it hits us. If we start at the start symbol, and follow the rules of our grammar, we will never get to a C<Term> symbol. That's what the warning message was trying to tell us. This makes us reread our grammar, and see that out C<Expression ::= Factor> rule is wrong. It should be C<Expression ::= Term>. We have found the source of our problem. =head1 COMPLICATIONS Internally, Marpa rewrites Earley items and grammars. L<C<show_progress>|Marpa::Recognizer/"show_progress"> hides most of this from the user. But some aspects of Marpa's rewrites are relevant and useful to know. =head2 Special Symbols Marpa uses a few special symbols internally which it is useful for the the user of L<C<show_progress>|Marpa::Recognizer/"show_progress"> to be aware of. To distinguish them, Marpa's internal symbols end in a right square bracket ("C<]>"). No user-defined symbol is allowed to end in a right square bracket. One of these special symbols is Marpa's special start symbol, which always ends in "C<[']>". Marpa augments all of its grammars with a special start rule, which will have the special start symbol on its left hand side. We saw this above with the C<Expression['] ::= Expression> rule. If the empty, or null, string is a sentence in the language of the grammar, Marpa will add a special empty start rule. The special empty start rule will have its own special null start symbol on its left hand side. The special null start symbol ends in "C<['][]>". =head2 Empty Rules If you have empty rules in your grammar, you may notice that they never appear in the traces. That is because Marpa allows at most one empty rule in any grammar, and that rule is the one Marpa adds itself: the special empty start rule. Marpa removes all empty and nulling rules in the original grammar. Marpa's removal of nulling rules is recursive, at it needs to be. Removing rules that are nulling reveals that the left hand side symbol of those rules is also nulling. This in turn can reveal other nulling rules. =head2 Sequences Marpa allows the user to explicitly specify sequences, rather than write them out in BNF. Marpa is able to optimize explicitly specified sequences. For the actual parsing, Marpa rewrites sequences into BNF. In the Earley items, the rules will have been translated into BNF, so that is how they appear in L<C<show_progress>|Marpa::Recognizer/"show_progress">. Marpa's rewritten sequence rules take much the same form that a programmer's rewritten rules would, if she had to do the rewrite by hand. Here's are the rules of a Marpa grammar, with a sequence: =for Marpa::Display: name: Debug Sequence Example perltidy: '-dcsc -sil=0' my $grammar = Marpa::Grammar->new( { start => 'Document', strip => 0, lhs_terminals => 0, rules => [ { lhs => 'Document', rhs => [qw/Stuff/], min => 1 }, ], } ); =for Marpa::Display::End And here is how Marpa translates this sequence: =for Marpa::Display name: Debug Sequence Example Progress Report partial: 1 normalize-whitespace: 1 PREDICTING @0 0: Document -> Stuff[Seq:1-*] PREDICTING @0 1: Stuff[Seq:1-*] -> Stuff PREDICTING @0 2: Stuff[Seq:1-*] -> Stuff[Seq:1-*] Stuff =for Marpa::Display::End =head2 Leo Items In the original Earley's algorithm, left recursion was very efficient, but right recursion required a completion Earley item for every step of the recursion in every Earley set where the right recursion might end. That made right recursion take O(I<n>**2) time and space, where I<n> was the length of the right recursion. Joop Leo figured out a way to get around this. Leo noted that most of the completions were trivial, in the sense that their only purpose was to create the topmost completion in a chain. He figured out how to get directly to the topmost completion and let that topmost completion stand in for the chain of trivial completions that produced it. This reduced the time and space required to O(I<n>). Marpa uses Leo's methods. The topmost completions are labeled "C<(Leo)>" as we'll see below. These topmost completions stand in for zero or more trivial completions in the same Earley set. Here is a example. First, the right recursive grammar: =for Marpa::Display: name: Debug Leo Example perltidy: '-dcsc -sil=0' my $grammar = Marpa::Grammar->new( { start => 'S', strip => 0, lhs_terminals => 0, rules => [ { lhs => 'S', rhs => [qw/Top_sequence/] }, { lhs => 'Top_sequence', rhs => [qw/Top Top_sequence/] }, { lhs => 'Top_sequence', rhs => [qw/Top/] }, { lhs => 'Top', rhs => [ qw/Upper_Middle/ ] }, { lhs => 'Upper_Middle', rhs => [ qw/Lower_Middle/ ] }, { lhs => 'Lower_Middle', rhs => [ qw/Bottom/ ] }, { lhs => 'Bottom', rhs => [ qw/T/ ] }, ], } ); =for Marpa::Display::End And here is the progress report at earleme 20: =for Marpa::Display name: Debug Leo Example Progress Report partial: 1 normalize-whitespace: 1 COMPLETED @0-20 0: S -> Top_sequence COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence COMPLETED @19-20 2: Top_sequence -> Top COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle COMPLETED @19-20 6: Bottom -> T COMPLETED @0-20 7: S['] -> S =for Marpa::Display::End =head3 The First Leo Topmost Completion In these reports we see three Leo topmost completions. Let's look at the first one: =for Marpa::Display name: Debug Leo Example Progress Report partial: 1 normalize-whitespace: 1 COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence =for Marpa::Display::End This is the top of a chain of 19 Earley items, all for the same rule: "C<Top_sequence ::= Top Top_sequence>". This topmost one spans the earleme range from 0 to 20 ("C<@0-20>"). The other 18 in the chain are trivial completions, completions whose only purpose is to produce the topmost completion. The next one down in the chain (not shown in the L<C<show_progress>|Marpa::Recognizer/"show_progress"> output) is identical, except for the origin, and so would look like this: =for Marpa::Display ignore: 1 COMPLETED @1-20 1: Top_sequence -> Top Top_sequence =for Marpa::Display::End Similarly, the remaining 17 Earley items would differ from the topmost one only in their origin, each one starting one earleme later than the one above it. The bottom of the chain is shown in the L<C<show_progress>|Marpa::Recognizer/"show_progress"> output and is a completion of a different rule: =for Marpa::Display name: Debug Leo Example Progress Report partial: 1 normalize-whitespace: 1 COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle =for Marpa::Display::End This example illustrates the improvement introduced by the Leo method. In this example, without the Leo methods, there would have been 20 items in the chain at earleme 20. At earleme 100 there would be 100 items, and so forth -- at earleme I<n> there would be I<n> items, all but two of them trivial. By earleme 100, there would have been a total (in all Earley sets up to that point) of over 4800 trivial Earley items. By earleme 1000, there would have been over 498,000, which would have made parsing with this grammar impractical for many applications. With the Leo methods, the number of Earley items in each set stays below a small constant. =head3 The Second Leo Topmost Completion The second Leo topmost completion was the bottom of the chain for the first topmost completion. It illustrates something to watch for: =for Marpa::Display name: Debug Leo Example Progress Report partial: 1 normalize-whitespace: 1 COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle =for Marpa::Display::End The chain of completions which the topmost item replaces can contain zero trivial completions. In other words, sometimes no completions are actually eliminated. That is the case here. This topmost completion stands in only for itself. =head3 The Third Leo Topmost Completion Our third Leo topmost completion illustrates a final point to watch for: =for Marpa::Display name: Debug Leo Example Progress Report partial: 1 normalize-whitespace: 1 COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle =for Marpa::Display::End Unlike the second Leo completion, this one does stand in for an trivial completion, one that was eliminated. If this trivial completion had appeared in the L<C<show_progress>|Marpa::Recognizer/"show_progress"> output, it would have looked like this: =for Marpa::Display ignore: 1 COMPLETED @19-20 4: Lower_Middle -> Bottom =for Marpa::Display::End For the first Leo topmost completion, all the trivial completions that were eliminated were for the same rule as the topmost completion. That is B<not> always the case. Leo's methods catches and optimizes right recursion, even when it is indirect, and passes through a series of different rules. In this example, the trivial completion which was eliminated is not for the same rule as the rule in the Leo topmost completion which stands in for it. =head3 Futures In future versions of Marpa, L<C<show_progress>|Marpa::Recognizer/"show_progress"> may say more about with the missing trivial completions. It is not yet clear how much information, and what kind of information, to provide. Simply printing out all the eliminated, trivial completions does not seem to really be an option. When a right recursion is thousands of steps long, which it can easily be, there will often be millions of trivial completions. Nonetheless, more can be done in letting the user know about eliminated, trivial completions. =head1 APPENDIX: FULL TRACES AND CODE FOR THE BASE EXAMPLE Below are the code, the trace outputs and the progress report for the base example used in this document. =head2 Code for the example =for Marpa::Display: name: Debug Example Part 1 perltidy: '-dcsc -sil=0' my $grammar = Marpa::Grammar->new( { start => 'Expression', actions => 'My_Actions', default_action => 'first_arg', strip => 0, rules => [ ## This is a deliberate error in the grammar ## The next line should be: ## { lhs => 'Expression', rhs => [qw/Term/] }, ## I have changed the Term to 'Factor' which ## will cause problems. { lhs => 'Expression', rhs => [qw/Factor/] }, { lhs => 'Term', rhs => [qw/Factor/] }, { lhs => 'Factor', rhs => [qw/Number/] }, { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' }, { lhs => 'Factor', rhs => [qw/Factor Multiply Factor/], action => 'do_multiply' }, ], } ); =for Marpa::Display::End =for Marpa::Display: name: Debug Example Part 2 perltidy: '-dcsc -sil=0' $grammar->precompute(); my @tokens = ( [ 'Number', 42 ], [ 'Multiply', q{*} ], [ 'Number', 1 ], [ 'Add', q{+} ], [ 'Number', 7 ], ); sub My_Actions::do_add { my ( undef, $t1, undef, $t2 ) = @_; return $t1 + $t2; } sub My_Actions::do_multiply { my ( undef, $t1, undef, $t2 ) = @_; return $t1 * $t2; } sub My_Actions::first_arg { shift; return shift; } my $recce = Marpa::Recognizer->new( { grammar => $grammar, trace_terminals => 2, mode => 'stream' } ); my $token_ix = 0; my ( $current_earleme, $expected_tokens ) = $recce->tokens( \@tokens, \$token_ix ); if ( $token_ix <= $#tokens ) { $progress_report = $recce->show_progress( 0, $current_earleme ); } =for Marpa::Display::End =head2 Trace Output =for Marpa::Display name: Debug Example Trace Output remove-display-indent: 1 remove-blank-last-line: 1 Inaccessible symbol: Add Inaccessible symbol: Term Setting trace_terminals option Expecting "Number" at earleme 0 Expecting "Expression" at earleme 0 Expecting "Factor" at earleme 0 Accepted "Number" at 0-1 Expecting "Multiply" at 1 Accepted "Multiply" at 1-2 Expecting "Number" at 2 Expecting "Factor" at 2 Accepted "Number" at 2-3 Expecting "Multiply" at 3 Rejected "Add" at 3 =for Marpa::Display::End =head2 Progress Output =for Marpa::Display name: Debug Example Progress Report remove-display-indent: 1 remove-blank-last-line: 1 PREDICTING @0 0: Expression -> Factor PREDICTING @0 2: Factor -> Number PREDICTING @0 4: Factor -> Factor Multiply Factor PREDICTING @0 5: Expression['] -> Expression BUILDING @0-1 Factor -> Factor . Multiply Factor COMPLETED @0-1 0: Expression -> Factor COMPLETED @0-1 2: Factor -> Number COMPLETED @0-1 5: Expression['] -> Expression PREDICTING @2 2: Factor -> Number PREDICTING @2 4: Factor -> Factor Multiply Factor BUILDING @0-2 Factor -> Factor Multiply . Factor BUILDING @0-3 Factor -> Factor . Multiply Factor BUILDING @2-3 Factor -> Factor . Multiply Factor COMPLETED @0-3 0: Expression -> Factor COMPLETED @2-3 2: Factor -> Number COMPLETED @0-3 4: Factor -> Factor Multiply Factor COMPLETED @0-3 5: Expression['] -> Expression =for Marpa::Display::End =head1 LICENSE AND COPYRIGHT Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution. =cut # Local Variables: # mode: cperl # cperl-indent-level: 4 # fill-column: 100 # End: # vim: expandtab shiftwidth=4: