=head1 NAME

Marpa::Debug - Marpa Grammar Debugging

=head1 OVERVIEW

This document describes Marpa's
more powerful general-use tracing
and debugging techniques in detail.
It assumes that you have written
a grammar for your Marpa application,
and that something is going wrong.
It is a very good idea to
read
L<the overview document
for tracing
problems|Marpa::Tracing> before
reading this document.

=head1 DESCRIPTION

To read the
L<C<show_progress>|Marpa::Recognizer/"show_progress">
output, it is important to have a
basic idea of what 
Earley items are,
and of what the information in them means.
Everything that the user needs to know
is explained in this section.

=head2 Dotted Rules
 
The idea behind Earley's algorithm is that you can
parse by building a table of rules
and where you are in those rules.
"Where" means two things: location in the rule relative to the rule's
symbols,
and location relative to the parse's input stream.

Let's look at an example of a rule in a context-free grammar.
Here's the rule for assignment from the Perl distribution's C<perly.y>

S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= term ASSIGNOP term>>

In parsing this rule, we can be at any of four possible locations.
One location is at the beginning, before all of the symbols.
The other three locations are immediately after each of the rule's
three symbols.

Within a rule, position relative to the symbols of the rule
is traditionally indicated with a dot.  In fact, the symbol-relative
rule position is very often called the B<dot location>.  Taken as
a pair, a rule and a dot location are called a B<dotted rule>.

Here's our rule with a dot location indicated:

S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= E<middot> term ASSIGNOP term>>
     
The dot location in this dotted rule is at the beginning.
A dot location at the beginning of a dotted rule means
that we have not recognized any symbols in the rule yet.
All we are doing is predicting that the rule will occur.
A dotted rule with the dot before all of its symbols is called a B<prediction>.

Here's another dotted rule:

S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= term E<middot> ASSIGNOP term>>

In this dotted rule,
we are saying we have seen a C<term>, but have not yet recognized
an C<ASSIGNOP>.  (C<ASSIGNOP> is C<perly.y>'s internal name for
the assignment operator.
In plain Perl terms, this is the "C<=>" character.)

There's another special kind of dotted rule, a completion.
A B<completion> is a dotted rule with the dot after all of the symbols.
Here's the completed version of the example we've been using:

S<C<E<nbsp>E<nbsp>E<nbsp>E<nbsp>termbinop ::= term ASSIGNOP term E<middot>>>

A completion indicates that a rule has been fully recognized.

=head2 Earley Items

In dotted rules, Earley's algorithm has all but one piece of the information it needs to track.
The final piece is the second of the two "wheres": where in the input stream.
To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.

A convenient way to think of an B<Earley item> is as a triple, or 3-tuple,
consisting of dotted rule, origin and current location.
The B<origin> is the location in the input stream where the dotted rule starts.
The B<current location> is the location in the input stream which corresponds to the
dot position.

Two noteworthy consequences
follow from way in which origin and current location are defined.
First,
if a dotted rule is a prediction, origin and current location will always be the same.
Second,
the input stream location where a rule ends is not tracked unless the
dotted rule is a completion.
In other cases,
we don't know if the rule will
be completed,
much less at which location.

=head1 THE BASE EXAMPLE

For this example of debugging,
I've taken a very common example
of a grammar
and deliberately introduced a problem.
(All the code and the full trace outputs
for this base example are in the Appendix.)
I've commented out the correct start rule:

=for Marpa::Display
name: Debug Example Part 1
partial: 1
normalize-whitespace: 1

        ## { lhs => 'Expression', rhs => [qw/Term/] },

=for Marpa::Display::End

and replaced it with another start rule,
one which will cause problems:

=for Marpa::Display
name: Debug Example Part 1
partial: 1
normalize-whitespace: 1

        { lhs => 'Expression', rhs => [qw/Factor/] },

=for Marpa::Display::End

In what follows, we'll pretend we don't already know
where the problem is,
and use the Marpa diagnostics and tracing facilities
to "discover" it.

=head1 AN EARLY WARNING

Right off the bat, we get two warning messages:

=for Marpa::Display
name: Debug Example Trace Output
partial: 1
normalize-whitespace: 1

    Inaccessible symbol: Add
    Inaccessible symbol: Term

=for Marpa::Display::End

If we were alert, these would be enough to tell us there is
a serious problem.
"Inaccessible" symbols are symbols which cannot be reached
from the start symbol.
This means that the grammar will never produce them,
and that parses will never find them in the input.

Since C<Add> and C<Term> are both important symbols
in our application,
that should tell us our grammar has a serious problem.
In fact,
these warning messages would often be enough
to point us to the error.
But, in order to look at more of Marpa's
tracing facilities, let's pretend we have not
had our morning coffee,
and that we miss the significance of these warning messages.

=head1 THE trace_terminals OUTPUT

Before looking at Marpa's progress reports,
it usually best to orient yourself by looking
at the output from
L<C<trace_terminals>|Marpa::Recognizer/trace_terminals>.
Typically, you will be interested in what were the last tokens
to be accepted, and perhaps what tokens the recognizer was
looking for when it didn't find what it wanted.
Sometimes that information alone is enough to make it clear
where the problem is.

The full 
L<C<trace_terminals>|Marpa::Recognizer/trace_terminals>
output for this base example is in the Appendix.
We see that the recognizer seems to accept "C<42*1>" but it fails
when it confronts the plus sign ("C<+>").
The last two lines are:

=for Marpa::Display
name: Debug Example Trace Output
partial: 1
normalize-whitespace: 1

    Accepted "Number" at 2-3
    Expecting "Multiply" at 3

=for Marpa::Display::End

A note in passing: Marpa shows the location of the tokens it accepts
as a range of locations.  For C<Number>, the range is "C<2-3>", indicating that
earleme 2 is the start location and earleme 3 is the end location.
That level of detail will seem like overkill in ordinary applications,
where every token has length 1.
But Marpa allows other input models, and in those models the information
about start and end location of the token is important.

Returning to the problem at hand:
debugging the grammar.
We notice that at earleme 3 we are expecting a C<Multiply> operator,
but not an C<Add> operator.
That should strike us as strange,
and send us back to the grammar.
But for the sake of our example we will
assume that we are slow on the uptake today,
and that this does not clue us in.
We move on.

=head1 THE show_progress OUTPUT

The last tool that you should ever need for debugging
Marpa grammars are its progress reports.
The progress reports show
Earley items in their original form.
Marpa does not actually use Earley's original items directly.
But they underlie Marpa's algorithm
and Marpa reconstructs them for
L<C<show_progress>|Marpa::Recognizer/"show_progress">.

In the Appendix, progress reports for the entire parse are shown.
Our base example in this document is a very small one, so that's a
reasonable thing to do in this case.
If a parse is at all large, you'll often need to be selective.

Of most interest is the progress report for the "current"
Earley set.
In our example that's earleme 3.
You can find out the current earleme from
the return value of either the
L<C<tokens> method|Marpa::Recognizer/"tokens">
or the
L<C<status> method|Marpa::Recognizer/"status">.
But it may be easier to simply call
L<C<show_progress>|Marpa::Recognizer/"show_progress">
without arguments.
By default,
L<C<show_progress>|Marpa::Recognizer/"show_progress">
prints out only the progress reports for the current earleme.

Here are the progress reports for the current earleme,
earleme 3,
from our example.

=for Marpa::Display
name: Debug Example Progress Report
partial: 1
normalize-whitespace: 1

    BUILDING @0-3 Factor -> Factor . Multiply Factor
    BUILDING @2-3 Factor -> Factor . Multiply Factor
    COMPLETED @0-3 0: Expression -> Factor
    COMPLETED @2-3 2: Factor -> Number
    COMPLETED @0-3 4: Factor -> Factor Multiply Factor
    COMPLETED @0-3 5: Expression['] -> Expression

=for Marpa::Display::End

As an aside, we see that
there is a completed rule for C<Expression[']>,
which is Marpa's special start symbol.
That indicates that if our input ended at earleme 3, it would be a valid sentence
in the language of our grammar.

More relevant for our purposes,
we see that we have completed rules for C<Expression>, and C<Factor>,
as expected.
We also see two Earley items that show
that we are in the process of building another C<Factor>,
and that it is expecting a C<Multiply> symbol.
This is not the rule we want, but it explains why the C<trace_terminals>
output showed that the recognizer was expecting a 
C<Multiply> symbol.

What we want to know is,
why is the recognizer B<not> expecting an C<Add> symbol?
Looking back at the grammar, we see that only one rule uses
the C<Add> symbol: the rule "C<Term ::= Term Add Term>".
The next step is to look at the Earley items for this rule.
But there is a problem.
We don't find any.

Next, we ask ourselves, what is the earliest place the 
"C<Term ::= Term Add Term>" rule should be appearing?
The answer is that 
there should be a prediction of "C<Term ::= Term Add Term>" at earleme 0.
So we look at the predictions at earleme 0.

=for Marpa::Display
name: Debug Example Progress Report
partial: 1
normalize-whitespace: 1

    PREDICTING @0 0: Expression -> Factor
    PREDICTING @0 2: Factor -> Number
    PREDICTING @0 4: Factor -> Factor Multiply Factor
    PREDICTING @0 5: Expression['] -> Expression

=for Marpa::Display::End

No "C<Term ::= Term Add Term>" rule.
We are not even predicting a
"C<Term ::= Term Add Term>" rule.
We look back at the grammar, and start from
the beginning.

=for Marpa::Display:
name: Debug Example Part 1
partial: 1
normalize-whitespace: 1

    { lhs     => 'Expression', rhs => [qw/Factor/] },
    { lhs => 'Term',       rhs => [qw/Factor/] },
    { lhs => 'Factor',     rhs => [qw/Number/] },
    {   lhs    => 'Term',
        rhs    => [qw/Term Add Term/],
        action => 'do_add'
    },
    {   lhs    => 'Factor',
        rhs    => [qw/Factor Multiply Factor/],
        action => 'do_multiply'
    },

=for Marpa::Display::End

Our special start symbol is C<Expression[']> and
we do see a prediction for the rule with
C<Expression[']>
on the left hand side.
It in turn predicts an C<Expression> symbol,
and we do see a prediction for the rule with C<Expression> on the left
hand side.
C<Expression> in turn predicts a C<Factor> symbol,
and we do see a prediction for both of the rules
with C<Factor> on the left
hand side.

But we don't see anything predicting C<Term>.
And then it hits us.
If we start at the start symbol,
and follow the rules of our grammar, we will never get to a C<Term>
symbol.
That's what the warning message was trying to tell us.

This makes us reread our grammar,
and see that out C<Expression ::= Factor> rule is wrong.
It should be C<Expression ::= Term>.
We have found the source of our problem.

=head1 COMPLICATIONS

Internally, Marpa rewrites Earley items and grammars.
L<C<show_progress>|Marpa::Recognizer/"show_progress">
hides most of this from the user.
But some aspects of Marpa's rewrites are relevant
and useful to know.

=head2 Special Symbols

Marpa uses 
a few special symbols internally which
it is useful
for the the user of
L<C<show_progress>|Marpa::Recognizer/"show_progress">
to be aware of.
To distinguish them,
Marpa's internal symbols end in a right square bracket ("C<]>").
No user-defined symbol is allowed to
end in a right square bracket.

One of these special symbols is Marpa's special start symbol,
which always ends in "C<[']>".
Marpa augments all of its grammars with
a special start rule, which will have the special start symbol
on its left hand side.
We saw this above with the C<Expression['] ::= Expression> rule.

If the empty, or null, string is a sentence in the language of
the grammar,
Marpa will add a special empty start rule.
The special empty start rule
will have its own special null start symbol
on its left hand side.
The special null start symbol ends in "C<['][]>".

=head2 Empty Rules


If you have empty rules in your grammar,
you may notice that they never
appear in the traces.
That is because Marpa allows at most one empty rule in any
grammar, and that rule is the one Marpa adds itself:
the special empty start rule.
Marpa removes all empty and nulling rules in the original grammar.

Marpa's removal of nulling rules is recursive,
at it needs to be.
Removing rules that are nulling
reveals that the left hand side
symbol of those rules is also nulling.
This in turn can reveal
other nulling rules.

=head2 Sequences

Marpa allows the user to explicitly specify sequences,
rather than write them out in BNF.
Marpa is able to optimize explicitly specified sequences.
For the actual parsing,
Marpa rewrites sequences into BNF.

In the Earley items,
the rules will have been translated into BNF,
so that is how they appear in
L<C<show_progress>|Marpa::Recognizer/"show_progress">.
Marpa's rewritten sequence rules
take much the same form
that a programmer's rewritten rules would,
if she had to do the rewrite by hand.

Here's are the rules of a Marpa grammar,
with a sequence:

=for Marpa::Display:
name: Debug Sequence Example
perltidy: '-dcsc -sil=0'

    my $grammar = Marpa::Grammar->new(
        {   start         => 'Document',
            strip         => 0,
            lhs_terminals => 0,
            rules         => [
                { lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
            ],
        }
    );

=for Marpa::Display::End

And here is how Marpa translates this sequence:

=for Marpa::Display
name: Debug Sequence Example Progress Report
partial: 1
normalize-whitespace: 1

    PREDICTING @0 0: Document -> Stuff[Seq:1-*]
    PREDICTING @0 1: Stuff[Seq:1-*] -> Stuff
    PREDICTING @0 2: Stuff[Seq:1-*] -> Stuff[Seq:1-*] Stuff

=for Marpa::Display::End

=head2 Leo Items

In the original Earley's algorithm,
left recursion was very efficient,
but right recursion required a completion Earley item
for every step of the recursion in every Earley set
where the right recursion might end.
That made right recursion take O(I<n>**2) time
and space, where I<n> was the length of the right
recursion.

Joop Leo figured out a way to get around this.
Leo noted that most of the completions were trivial,
in the sense
that their only purpose was to create the topmost
completion in a chain.
He figured out how to get directly
to the topmost completion
and let that topmost completion
stand in for the chain of trivial completions
that produced it.
This reduced the time and space
required to O(I<n>).
Marpa uses Leo's methods.

The topmost completions are labeled 
"C<(Leo)>" as we'll see below.
These topmost completions
stand in for
zero or more trivial completions
in the same Earley set.

Here is a example.  First, the right recursive grammar:

=for Marpa::Display:
name: Debug Leo Example
perltidy: '-dcsc -sil=0'

    my $grammar = Marpa::Grammar->new(
        {   start         => 'S',
            strip         => 0,
            lhs_terminals => 0,
            rules         => [
                { lhs => 'S', rhs => [qw/Top_sequence/] },
                { lhs => 'Top_sequence', rhs => [qw/Top Top_sequence/] },
                { lhs => 'Top_sequence', rhs => [qw/Top/] },
                { lhs => 'Top', rhs => [ qw/Upper_Middle/ ] },
                { lhs => 'Upper_Middle', rhs => [ qw/Lower_Middle/ ] },
                { lhs => 'Lower_Middle', rhs => [ qw/Bottom/ ] },
                { lhs => 'Bottom', rhs => [ qw/T/ ] },
            ],
        }
    );

=for Marpa::Display::End

And here is the progress report at earleme 20:

=for Marpa::Display
name: Debug Leo Example Progress Report
partial: 1
normalize-whitespace: 1

    COMPLETED @0-20 0: S -> Top_sequence
    COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence
    COMPLETED @19-20 2: Top_sequence -> Top
    COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle
    COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle
    COMPLETED @19-20 6: Bottom -> T
    COMPLETED @0-20 7: S['] -> S

=for Marpa::Display::End

=head3 The First Leo Topmost Completion

In these reports we see three Leo topmost completions.
Let's look at the first one:

=for Marpa::Display
name: Debug Leo Example Progress Report
partial: 1
normalize-whitespace: 1

    COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence

=for Marpa::Display::End

This is the top of a chain of 19 Earley items,
all for the same rule:
"C<Top_sequence ::= Top Top_sequence>".
This topmost one spans the earleme range from 0 to 20 ("C<@0-20>").
The other 18 in the chain are trivial completions,
completions whose only purpose is to produce the topmost completion.
The next one down in the chain (not shown in the 
L<C<show_progress>|Marpa::Recognizer/"show_progress">
output) is identical, except for the origin,
and so would look like this:

=for Marpa::Display
ignore: 1

    COMPLETED @1-20 1: Top_sequence -> Top Top_sequence

=for Marpa::Display::End

Similarly,
the remaining 17 Earley items
would differ from the topmost one
only in their origin,
each one starting one earleme later than the one above it.
The bottom of the chain is
shown in the 
L<C<show_progress>|Marpa::Recognizer/"show_progress">
output and is
a completion of a different rule:

=for Marpa::Display
name: Debug Leo Example Progress Report
partial: 1
normalize-whitespace: 1

    COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle

=for Marpa::Display::End

This example illustrates the improvement introduced by
the Leo method.
In this example, without the Leo methods,
there would have been 20 items
in the chain at earleme 20.
At earleme 100 there would be 100 items,
and so forth --
at earleme I<n> there would be I<n> items, all but
two of them trivial.

By earleme 100, there would have been a total (in all
Earley sets up to that point) of over 4800 trivial Earley
items.
By earleme 1000, there would have been over 498,000,
which would have made parsing with
this grammar
impractical for many applications.
With the Leo methods, the number of Earley items in
each set stays below a small constant.

=head3 The Second Leo Topmost Completion

The second Leo topmost completion was the bottom of the chain
for the first topmost completion.
It illustrates something to watch for:

=for Marpa::Display
name: Debug Leo Example Progress Report
partial: 1
normalize-whitespace: 1

    COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle

=for Marpa::Display::End

The chain of completions which the topmost item replaces
can contain zero trivial completions.
In other words, sometimes no completions are actually eliminated.
That is the case here.
This topmost completion stands in only for itself.

=head3 The Third Leo Topmost Completion

Our third Leo topmost completion illustrates a final point to watch for:

=for Marpa::Display
name: Debug Leo Example Progress Report
partial: 1
normalize-whitespace: 1

    COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle

=for Marpa::Display::End

Unlike the second Leo completion,
this one does stand in for an trivial
completion, one that was eliminated.
If this trivial completion had appeared in the
L<C<show_progress>|Marpa::Recognizer/"show_progress">
output, it would have looked like this:

=for Marpa::Display
ignore: 1

    COMPLETED @19-20 4: Lower_Middle -> Bottom

=for Marpa::Display::End

For the first Leo topmost completion, all the trivial completions
that were eliminated were for the same rule as the topmost completion.
That is B<not> always the case.
Leo's methods catches and optimizes right recursion, even when it is
indirect, and passes through a series of different rules.
In this example,
the trivial completion which was eliminated is not for the same rule
as the rule in the Leo topmost completion
which stands in for it.

=head3 Futures

In future versions of Marpa,
L<C<show_progress>|Marpa::Recognizer/"show_progress">
may say more about with the missing trivial completions.
It is not yet clear
how much information, and what kind of information,
to provide.
Simply printing out all the eliminated, trivial completions
does not seem to
really be an option.
When a right recursion is thousands of steps long,
which it can easily be,
there will often be millions of trivial completions.
Nonetheless, more can be done in letting the user know
about eliminated, trivial completions.

=head1 APPENDIX: FULL TRACES AND CODE FOR THE BASE EXAMPLE

Below are the code, the trace outputs
and the progress report
for the base example used in this
document.

=head2 Code for the example

=for Marpa::Display:
name: Debug Example Part 1
perltidy: '-dcsc -sil=0'

    my $grammar = Marpa::Grammar->new(
        {   start          => 'Expression',
            actions        => 'My_Actions',
            default_action => 'first_arg',
            strip          => 0,
            rules          => [
                ## This is a deliberate error in the grammar
                ## The next line should be:
                ## { lhs => 'Expression', rhs => [qw/Term/] },
                ## I have changed the Term to 'Factor' which
                ## will cause problems.
                { lhs => 'Expression', rhs => [qw/Factor/] },
                { lhs => 'Term',       rhs => [qw/Factor/] },
                { lhs => 'Factor',     rhs => [qw/Number/] },
                {   lhs    => 'Term',
                    rhs    => [qw/Term Add Term/],
                    action => 'do_add'
                },
                {   lhs    => 'Factor',
                    rhs    => [qw/Factor Multiply Factor/],
                    action => 'do_multiply'
                },
            ],
        }
    );

=for Marpa::Display::End

=for Marpa::Display:
name: Debug Example Part 2
perltidy: '-dcsc -sil=0'

    $grammar->precompute();

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', q{*} ],
        [ 'Number', 1 ],
        [ 'Add', q{+} ],
        [ 'Number', 7 ],
    );

    sub My_Actions::do_add {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 + $t2;
    }

    sub My_Actions::do_multiply {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 * $t2;
    }

    sub My_Actions::first_arg { shift; return shift; }

    my $recce = Marpa::Recognizer->new(
        { grammar => $grammar, trace_terminals => 2, mode => 'stream' } );

    my $token_ix = 0;

    my ( $current_earleme, $expected_tokens ) =
        $recce->tokens( \@tokens, \$token_ix );

    if ( $token_ix <= $#tokens ) {
        $progress_report = $recce->show_progress( 0, $current_earleme );
    }

=for Marpa::Display::End

=head2 Trace Output

=for Marpa::Display
name: Debug Example Trace Output
remove-display-indent: 1
remove-blank-last-line: 1

    Inaccessible symbol: Add
    Inaccessible symbol: Term
    Setting trace_terminals option
    Expecting "Number" at earleme 0
    Expecting "Expression" at earleme 0
    Expecting "Factor" at earleme 0
    Accepted "Number" at 0-1
    Expecting "Multiply" at 1
    Accepted "Multiply" at 1-2
    Expecting "Number" at 2
    Expecting "Factor" at 2
    Accepted "Number" at 2-3
    Expecting "Multiply" at 3
    Rejected "Add" at 3

=for Marpa::Display::End

=head2 Progress Output

=for Marpa::Display
name: Debug Example Progress Report
remove-display-indent: 1
remove-blank-last-line: 1

    PREDICTING @0 0: Expression -> Factor
    PREDICTING @0 2: Factor -> Number
    PREDICTING @0 4: Factor -> Factor Multiply Factor
    PREDICTING @0 5: Expression['] -> Expression
    BUILDING @0-1 Factor -> Factor . Multiply Factor
    COMPLETED @0-1 0: Expression -> Factor
    COMPLETED @0-1 2: Factor -> Number
    COMPLETED @0-1 5: Expression['] -> Expression
    PREDICTING @2 2: Factor -> Number
    PREDICTING @2 4: Factor -> Factor Multiply Factor
    BUILDING @0-2 Factor -> Factor Multiply . Factor
    BUILDING @0-3 Factor -> Factor . Multiply Factor
    BUILDING @2-3 Factor -> Factor . Multiply Factor
    COMPLETED @0-3 0: Expression -> Factor
    COMPLETED @2-3 2: Factor -> Number
    COMPLETED @0-3 4: Factor -> Factor Multiply Factor
    COMPLETED @0-3 5: Expression['] -> Expression

=for Marpa::Display::End

=head1 LICENSE AND COPYRIGHT

Copyright 2007-2010 Jeffrey Kegler, all rights reserved.
Marpa is free software under the Perl license.
For details see the LICENSE file in the Marpa distribution.

=cut

# Local Variables:
#   mode: cperl
#   cperl-indent-level: 4
#   fill-column: 100
# End:
# vim: expandtab shiftwidth=4: