NAME

Marpa::PP::Debug - Marpa Grammar Debugging

OVERVIEW

This document describes the more powerful of Marpa's tracing and debugging techniques. It assumes that you have written a grammar for your Marpa application, and that something is going wrong. You should read the overview document for tracing problems before reading this document.

INTRODUCTION TO EARLEY PARSING

To read the show_progress output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section.

Dotted Rules

The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.

Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's perly.y

    termbinop -> term ASSIGNOP term

ASSIGNOP is perly.y's internal name for the assignment operator. In plain Perl terms, this is the "=" character.

In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols.

Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule.

Here's our rule with a dot location indicated:

    termbinop -> · term ASSIGNOP term

The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a prediction or a predicted rule.

Here's another dotted rule:

    termbinop -> term · ASSIGNOP term

In this dotted rule, we are saying we have seen a term, but have not yet recognized an ASSIGNOP.

There's another special kind of dotted rule, a completion. A completion (also called a completed rule) is a dotted rule with the dot after all of the symbols. Here is the completion for the rule that we have been using as an example:

    termbinop -> term ASSIGNOP term ·

A completion indicates that a rule has been fully recognized.

Earley Items

The dotted rules contain all but one piece of the information that Earley's algorithm needs to track. The missing piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.

A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location (also called the dot location) is the location in the input stream which corresponds to the dot position.

Two noteworthy consequences follow from the way in which origin and current location are defined. First, if a dotted rule is a prediction, then origin and current location will always be the same. Second, the input stream location where a rule ends is not tracked unless the dotted rule is a completion. In other cases, an Earley item does not tell us if a rule will ever be completed, much less at which location.

THE EXAMPLE

For this example of debugging, I've taken a very common example of a grammar and deliberately introduced a problem. (All the code and the full trace outputs for this example are in the Appendix.) I've commented out the correct start rule:

## { lhs => 'Expression', rhs => [qw/Term/] },

and replaced it with another start rule, one which will cause problems:

{ lhs => 'Expression', rhs => [qw/Factor/] },

In what follows, we'll pretend we don't already know where the problem is, and use the Marpa diagnostics and tracing facilities to "discover" it.

FIRST WARNING

Right off the bat, we get two warning messages:

Inaccessible symbol: Add
Inaccessible symbol: Term

If we were alert, these would be enough to tell us there is a serious problem. "Inaccessible" symbols are symbols which cannot be reached from the start symbol. This means that the grammar will never produce them, and that parses will never find them in the input.

Since Add and Term are both important symbols in our application, that should tell us our grammar has a serious problem. In fact, these warning messages would often be enough to point us to the error. But, in order to look at more of Marpa's tracing facilities, let's pretend we have not had our morning coffee, and that we miss the significance of these warning messages.

OUTPUT FROM trace_terminals

Before looking at Marpa's progress reports, it is usually best to orient yourself by looking at the output from trace_terminals. Typically, you will be interested in the last tokens to be accepted, and perhaps also in the tokens the recognizer was looking for when it didn't find what it wanted. Sometimes that information alone is enough to make it clear where the problem is.

The full trace_terminals output for this example is in the Appendix. We see that the recognizer seems to accept "42*1" but it fails when it confronts the plus sign ("+"). The last two lines are:

Accepted "Number" at 2-3
Expecting "Multiply" at 3

A note in passing: Marpa shows the location of the tokens it accepts as a range of locations. For Number, the range is "2-3", indicating that the token starts at location 2 and ends at location 3. In its default input model, all tokens have length 1, so this is clearly information overkill. But Marpa allows other input models, and in those models the information about start and end location of the token is important.

Returning to the problem at hand: We notice that at location 3 we are expecting a Multiply operator, but not an Add operator. That should strike us as strange, and send us back to the grammar. But for the sake of our example we will assume that we are slow on the uptake today, and that this does not clue us in. We move on.

OUTPUT FROM show_progress

Marpa's most powerful tool for debugging grammars is its progress report, which shows the Earley items being worked on. In the Appendix, progress reports for the entire parse are shown. Our example in this document is a very small one, so that producing progress reports for the entire parse is a reasonable thing to do in this case. If a parse is at all large, you will usually need to be selective.

The progress report that is usually of most interest is the one for the Earley set that you were working on when the error occurred. This is called the current location. In our example the current location is location 3. By default, show_progress prints out only the progress reports for the current location.

Here are the progress reports for the current location, location 3, from our example.

F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
F5 @0-3 Expression['] -> Expression .

Progress Report Lines

The last field of each Progress Report line shows, in fully expanded form, the dotted rule we were working on. Since that is the most important information, it may be tempting to skip the rest of this section, and move directly forward with the debugging.

In fact, you might want to do exactly that -- skip to the beginning of the next section. What follows talks about the details of the format of the first few fields in each progress report line. These first few fields, while helpful, are also usually one or more of obvious in their meaning, not relevant to our example, and repetitive of information which can be deduced from other fields.

F5 @0-3 Expression['] -> Expression .

Prefixed to the dotted rule are two fields: "F5 @0-3". The "F5" says that this is a completed or final rule, and that it is rule number 5. The rule number is used in other tracing and debugging output, when displaying the whole rule would take too much space. In what follows we won't need the rule number.

The "@0-3" describes the location of the dotted rule in the parse. In its simplest form, the location field is two location numbers, separated by a hyphen. The first location number is the origin, the place where Marpa first started recognizing the rule. The last location number is the dot location, the location location of the dot in a dotted rule. "@0-3" say that this rule began at location 0, and that the dot is at location 3.

The current location is location 3, and this is no coincidence. Whenever we are displaying the progress report for an location, all the progress report lines will have their dot location at that location.

As an aside, notice that the left hand side symbol is Expression[']. That is is Marpa's special start symbol. The presence of a completed start rule in our progress report indicates that if our input ended at location 3, it would be a valid sentence in the language of our grammar.

Let's look at another progress report line:

R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor

Here the "R4:1" indicates that this is rule number 4 (the "R" stands for rule number) and that its dot position is after the first symbol on the right hand side. Symbol positions are numbered using the ordinal of the symbol just before the position. Symbols are numbered starting with 1, and symbol position 1 is the position immediately after symbol 1.

The next field ("x2") is new. It is a count. A progress report can contain multiple instances of the same dotted rule, and when there is more than one, a count field is included in the progress report line. Here the "x2" indicates that there are two instances of Factor -> Factor . Multiply Factor at this location.

Multiple instances of a dotted rule will differ in their origin, and where they do, this is shown in the location field of the progress report line. Here the location field is "@0,2-3", which indicates that one instance of this dotted rule has its origin at location 0, and the other has its origin at location 2. All instances reported on a single progress report line will always have the same dot location, and in this case it is location 3.

Predicted rules also appear in progress reports:

P2 @2-2 Factor -> . Number

Here the "P" in the summary field means "predicted". As with much of the information in the summary field, this only repeats what is obvious from the full expansion of the dotted rule later in the line. But final (or completed) and predicted rules can be important and the initial "F" and "P" make these lines easy to spot.

Notice that in the predicted rule, the origin is the same as the dot location. This will always be the case with predicted rules.

For any given location, no predicted rule has more than one instance. For other dotted rules, there may be many instances of the dotted rule at a single location. In grammars with right recursion, the number of instances is limited only by the length of the recursion. The length of a recursion is limited primarily by the available memory.

When there are many instances of a dotted rule at a single location, it is inconvenient to show all the origins in a comma-separated list. In that case the origins in the location field are shown as a range, with the earliest separated from the most recent by a "...". The example in this document contains no lines with a large number of instances, but here is an example from another grammar. This is the progress report line for the completed rule in a right recursion of length 20.

F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .

OK! Now to Find the Bug

Here again are progress reports at the location where things went wrong:

F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
F5 @0-3 Expression['] -> Expression .

We see that we have completed rules for Expression, and Factor, as expected. We also see two Earley items that show that we are in the process of building another Factor, and that it is expecting a Multiply symbol. This is not the rule we want, but it explains why the trace_terminals output showed that the recognizer was expecting a Multiply symbol.

What we want to know is, why is the recognizer not expecting an Add symbol? Looking back at the grammar, we see that only one rule uses the Add symbol: the rule "Term -> Term Add Term". The next step is to look at the Earley items for this rule. But there is a problem. We don't find any.

Next, we ask ourselves, what is the earliest place the "Term -> Term Add Term" rule should be appearing? The answer is that there should be a prediction of "Term -> Term Add Term" at location 0. So we look at the predictions at location 0.

P0 @0-0 Expression -> . Factor
P2 @0-0 Factor -> . Number
P4 @0-0 Factor -> . Factor Multiply Factor
P5 @0-0 Expression['] -> . Expression

No "Term -> Term Add Term" rule. We are never even predicting a "Term -> Term Add Term" rule. We look back at the grammar, and start from the beginning.

{ lhs     => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term',       rhs => [qw/Factor/] },
{ lhs => 'Factor',     rhs => [qw/Number/] },
{   lhs    => 'Term',
    rhs    => [qw/Term Add Term/],
    action => 'do_add'
},
{   lhs    => 'Factor',
    rhs    => [qw/Factor Multiply Factor/],
    action => 'do_multiply'
},

Our special start symbol is Expression['] and we do see a rule with Expression['] on the left hand side. This rule in turn produces an Expression symbol, and there is a rule with Expression on the left hand side. Expression in turn produces a Factor symbol, and there are two rules with Factor on the left hand side.

But none of these rules ever produce a Term. In fact, however far we follow the productions, no rule ever produces a Term. At this point we see the problem: If we start at the start symbol, and follow the rules of our grammar, we will never get to a Term symbol. Which is exactly what that first warning message was saying.

Now that we know what is wrong, we can reread our grammar, and see that our Expression -> Factor rule is wrong. It should be Expression -> Term. Change that and the problem is fixed.

GRAMMAR REWRITING

Internally, Marpa rewrites Earley items and grammars. show_progress hides most of this from the user, but not all of it. Some aspects of Marpa's rewrites are useful to know. For those interested, Marpa's grammar rewrites are described in complete detail in Marpa::PP::Rewrite.

Special Symbols

Marpa uses a few special symbols internally which it is useful for the user of show_progress to be aware of. To distinguish them, Marpa's internal symbols end in a right square bracket ("]"). No user-defined symbol is allowed to end in a right square bracket.

One of these special symbols is Marpa's special start symbol, which always ends in "[']". Marpa augments all of its grammars with a special start rule, which will have the special start symbol on its left hand side. We saw this above with the Expression['] -> Expression rule.

If the empty, or null, string is a sentence in the language of the grammar, Marpa will add a special empty start rule. The special empty start rule will have its own special null start symbol on its left hand side. The special null start symbol ends in "['][]".

Empty Rules

Marpa removes all of the empty and nulling rules in the original grammar. Internally, Marpa marks symbols as nulling and this produces the same result much more efficiently. Outwardly, the effect is the same, so much so that you might not even notice the absence of the original grammar's nulling and empty rules from the progress reports.

The only nulling rule that Marpa allows is a rule that it creates internally, not one specified by the user. If the grammar accepts the null string as valid input, Marpa creates a nulling start rule.

Marpa's removal of nulling rules is recursive, as it needs to be. Removing rules that are nulling reveals that the left hand side symbol of those rules is also nulling. This in turn can reveal other nulling rules.

Sequences

Marpa allows the user to explicitly specify sequences, rather than write them out in BNF. Marpa is able to optimize explicitly specified sequences. For the actual parsing, Marpa rewrites sequences into BNF.

In the Earley items, the rules will have been translated into BNF, so that is how they appear in show_progress. Marpa's rewritten sequence rules take much the same form that a programmer's rewritten rules would, if she had to do the rewrite by hand.

Here's are the rules of a Marpa grammar, with a sequence:

my $grammar = Marpa::PP::Grammar->new(
    {   start         => 'Document',
        strip         => 0,
        lhs_terminals => 0,
        rules         => [
            { lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
        ],
    }
);

And here is how Marpa translates this sequence:

P1 @0-0 Document -> . Document[Subseq:0:1]
P2 @0-0 Document[Subseq:0:1] -> . Stuff
P3 @0-0 Document[Subseq:0:1] -> . Document[Subseq:0:1] Stuff
P4 @0-0 Document['] -> . Document

APPENDIX: FULL CODE AND OUTPUT FOR THE EXAMPLE

Below are the code, the trace outputs and the progress report for the example used in this document.

Code

my $grammar = Marpa::PP::Grammar->new(
    {   start          => 'Expression',
        actions        => 'My_Actions',
        default_action => 'first_arg',
        strip          => 0,
        rules          => [
            ## This is a deliberate error in the grammar
            ## The next line should be:
            ## { lhs => 'Expression', rhs => [qw/Term/] },
            ## I have changed the Term to 'Factor' which
            ## will cause problems.
            { lhs => 'Expression', rhs => [qw/Factor/] },
            { lhs => 'Term',       rhs => [qw/Factor/] },
            { lhs => 'Factor',     rhs => [qw/Number/] },
            {   lhs    => 'Term',
                rhs    => [qw/Term Add Term/],
                action => 'do_add'
            },
            {   lhs    => 'Factor',
                rhs    => [qw/Factor Multiply Factor/],
                action => 'do_multiply'
            },
        ],
    }
);
$grammar->precompute();

my @tokens = (
    [ 'Number',   42 ],
    [ 'Multiply', q{*} ],
    [ 'Number',   1 ],
    [ 'Add',      q{+} ],
    [ 'Number',   7 ],
);

sub My_Actions::do_add {
    my ( undef, $t1, undef, $t2 ) = @_;
    return $t1 + $t2;
}

sub My_Actions::do_multiply {
    my ( undef, $t1, undef, $t2 ) = @_;
    return $t1 * $t2;
}

sub My_Actions::first_arg { shift; return shift; }

my $recce = Marpa::PP::Recognizer->new(
    { grammar => $grammar, trace_terminals => 2 } );

my $token_ix = 0;

TOKEN: for my $token_and_value (@tokens) {
    last TOKEN if not defined $recce->read( @{$token_and_value} );
}

$progress_report = $recce->show_progress( 0, -1 );

Trace Output

Inaccessible symbol: Add
Inaccessible symbol: Term
Setting trace_terminals option
Expecting "Expression" at earleme 0
Expecting "Factor" at earleme 0
Expecting "Number" at earleme 0
Accepted "Number" at 0-1
Expecting "Multiply" at 1
Accepted "Multiply" at 1-2
Expecting "Factor" at 2
Expecting "Number" at 2
Accepted "Number" at 2-3
Expecting "Multiply" at 3
Rejected "Add" at 3-4

Note the use of the term "earleme". If you are using the default input model, you can assume that earleme means "location": the earleme and the location will always be exactly the same. Advanced users, using alternative input models, may set it up so that earleme and location are two different things, and in that case the distinction will matter.

Progress Output

P0 @0-0 Expression -> . Factor
P2 @0-0 Factor -> . Number
P4 @0-0 Factor -> . Factor Multiply Factor
P5 @0-0 Expression['] -> . Expression
F0 @0-1 Expression -> Factor .
F2 @0-1 Factor -> Number .
R4:1 @0-1 Factor -> Factor . Multiply Factor
F5 @0-1 Expression['] -> Expression .
P2 @2-2 Factor -> . Number
P4 @2-2 Factor -> . Factor Multiply Factor
R4:2 @0-2 Factor -> Factor Multiply . Factor
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
F5 @0-3 Expression['] -> Expression .

COPYRIGHT AND LICENSE

Copyright 2012 Jeffrey Kegler
This file is part of Marpa::PP.  Marpa::PP is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.

Marpa::PP is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser
General Public License along with Marpa::PP.  If not, see
http://www.gnu.org/licenses/.