NAME

Marpa::PP::Recognizer - Marpa Recognizer Objects

SYNOPSIS

my $recce = Marpa::Recognizer->new( { grammar => $grammar } );
$recce->read( 'Number', 42 );
$recce->read( 'Multiply', );
$recce->read( 'Number', 1 );
$recce->read( 'Add', );
$recce->read( 'Number', 7 );

DESCRIPTION

To create a recognizer object, use the new method.

To read input, use the read method.

To evaluate a parse result, based on the input, use the value method.

Token Streams

By default, Marpa uses the token-stream model of input. The token-stream model is standard -- so standard the most documents about parsing do not bother to describe it. In the token-stream model, each read adds a token at the current location, then advances the current location by one. Assuming that the location before any input is numbered 0, as it is in Marpa, a parse of N tokens will fill the locations from 1 to N.

In Marpa, locations in the input stream are also called earlemes. The current earleme means exactly the same thing as the current location.

This document will describe only the token-stream model of input. Marpa allows other models of the input, but their use requires special method calls, which are described in the document on alternative input models. Any application that restricts itself to reading input using the methods described in this document will be using the default, token-stream, model of input.

CONSTRUCTOR

new

my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

The new method creates a recognizer object. The new method either returns a new recognizer object or throws an exception.

The arguments to the new method are references to hashes of named arguments. In each key/value pair of these hashes, the key is the argument name, and the hash value is the value of the argument. The named arguments are described below.

ACCESSORS

current_earleme

my $current_earleme = $recce->current_earleme();

Returns the current parse location, also known as the current earleme. Not often needed.

terminals_expected

my $terminals_expected = $recce->terminals_expected();

Returns a reference to a list of strings, where the strings are the names of the terminals acceptable at the current earleme. In the default input model, the presence of a terminal in this list means that terminal will be acceptable in the next read method call. This is highly useful for Ruby Slippers parsing.

check_terminal

my $is_document_a_terminal = $recce->check_terminal('Document');

Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not often needed.

MUTATORS

set

$recce->set( { max_parses => 10, } );

The set method's arguments are references to hashes of named arguments. The set method can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.

read

$recce->read( 'Number', 42 );
$recce->read( 'Multiply', );
$recce->read( 'Number', 1 );
$recce->read( 'Add', );
$recce->read( 'Number', 7 );

The read method reads one token at the current parse location (or current earleme). It then advances the current earleme by 1.

read takes two arguments: a token name and a token value. The token name is required. It must be the name of a valid terminal symbol. The token value is optional. It defaults to a Perl undef when not specified. For details about terminal symbols, see "Terminals" in Marpa::PP::Grammar.

The parser may accept or reject the token. If the parser accepted the token, the read method returns the number of tokens which are acceptable at the new current earleme. This number may be helpful in guiding Ruby Slippers parsing.

read may return zero, which means that no tokens will be acceptable at the next earleme. This is turn means that the next read call will fail. In the default input model, where the read method is the only means of inputing tokens, a zero return from a read method means that the parse is exhausted -- that no more input is possible. More details on "exhaustion" are in a section below.

Marpa may reject a token because it is not one of those acceptable at the current earleme. When this happens, read returns a Perl undef. A rejected token need not end parsing -- it is perfectly possible to retry the read call with another token. This is, in fact, an important technique in Ruby Slippers parsing. For details, see the section on Ruby Slippers parsing.

For other failures, including an attempt to read a token into an exhausted parser, Marpa throws an exception.

value

The value mutator evaluates and returns a parse result. It is described in its own section.

TRACE ACCESSORS

show_earley_sets

print $recce->show_earley_sets()
    or die "print failed: $ERRNO";

An advanced, internals-oriented tracing method, which will not be of interest to most users. Most users will want to use the show_progress method instead.

show_earley_sets returns a multi-line string listing every Earley item in every Earley set. show_earley_sets requires knowledge of Marpa internals to interpret.

show_progress

print $recce->show_progress()
    or die "print failed: $ERRNO";

Returns a string describing the progress of the parse. With no arguments, the string contains reports for the current location. With a non-negative argument N, the string contains reports for location N.

With two numeric arguments, N and M, the arguments are interpreted as a range of locations and the returned string contains reports for all locations in the range. The first argument, N, must be a non-negative integer, and is always the number of the earleme which begins the range. If both arguments are non-negative integers, the range is from earleme N to earleme M. If the second argument is a negative integer, -M, the end of the range is the Mth location from the furthest earleme. For example, if 42 was the furthest earleme, -1 would be earleme 42 and -2 would be earleme 41. The method call $recce->show_progress(0, -1) will print progress reports for the entire parse.

show_progress is Marpa's most powerful tool for debugging application grammars. It can also be used to track the progress of a parse or to investigate how a parse works. A much fuller description, with an example, is in the document on debugging Marpa grammars.

NAMED ARGUMENTS

grammar

The grammar named argument is required. Its value must be a precomputed Marpa grammar object.

ranking_method

The value must be a string: either "none" or "constant". When the value is "none", Marpa returns the parse results in arbitrary order. When the value is "constant", Marpa allows the user to specify ranking actions which assign values to rules and tokens. These values will control the order in which parse results are returned by the value method.

The default is for parse results to be returned in arbitrary order. For details, see the section on parse order in the semantics document.

too_many_earley_items

The too_many_earley_items argument is optional. If specified, it sets the Earley item warning threshold. If an Earley set becomes larger than the Earley item warning threshold, a warning is printed to the trace file handle.

Marpa parses from any BNF, and can handle grammars and inputs which produce large Earley sets. But parsing that involves large Earley sets can be slow. Large Earley sets are something most applications can, and will wish to, avoid.

By default, Marpa calculates an Earley item warning threshold based on the size of the grammar. The default threshold will never be less than 100. If the Earley item warning threshold is set to 0, warnings about large Earley sets are turned off.

trace_file_handle

The value is a file handle. Traces and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.

trace_terminals

Very handy in debugging, and often useful even when the problem is not in the lexing. The value is a trace level. When the trace level is 0, tracing of terminals is off. This is the default.

At a trace level of 1 or higher, Marpa produces a trace message for each terminal as it is accepted or rejected by the recognizer. At a trace level of 2 or higher, the trace messages include, for every location, a list of the terminals expected. In practical grammars, output from trace level 2 can be voluminous.

warnings

The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.

RUBY SLIPPERS PARSING

$recce =
    Marpa::Recognizer->new( { grammar => $grammar } );

my @tokens = (
    [ 'Number', 42 ],
    ['Multiply'], [ 'Number', 1 ],
    ['Add'],      [ 'Number', 7 ],
);

TOKEN: for ( my $token_ix = 0; $token_ix <= $#tokens; $token_ix++ ) {
    defined $recce->read( @{ $tokens[$token_ix] } )
        or fix_things( $recce, \@tokens )
        or die q{Don't know how to fix things};
}

Marpa is able to tell the application which symbols are acceptable as tokens at the next location in the parse. The terminals_expected method returns the list of tokens that will be accepted by the next read. The application can use this information to change the input "on the fly" so that it is acceptable to the parser.

An application which is doubtful about the acceptability of a token does not have to check the output of terminals_expected. It is a recoverable error if a token is rejected because it is not acceptable. If an application is not sure whether a token is acceptable or not, the application can simply attempt to input the dubious token using the read method. If the token is rejected, the read method call will return a Perl undef. At that point, the application can retry the read with a different token.

An Example

Marpa's HTML parser, Marpa::HTML, is an example of how Ruby Slippers parsing can help with a non-trivial, real-life application. When a token is rejected in Marpa::HTML, it changes the input to match the parser's expectations by

  • Modifying existing tokens, and

  • Creating new tokens.

The second technique, the creation of new, "virtual", tokens is used by Marpa::HTML to deal with omitted start and end tags. The actual HTML grammar that Marpa::HTML uses takes an oversimplified view of the HTML -- it assumes, contrary to fact, that start and end tags are always present.

Ruby Slippers parsing is used to make the grammar's over-simplistic view of the world come true for it. Whenever a token is rejected, Marpa::HTML looks at the expected tokens list. If it sees that a start or end tag is expected, Marpa::HTML creates a token for it -- a completely new "virtual" token that gives the parser exactly what it expects. Marpa::HTML then resumes input at the point in the original input stream where it left off.

EXHAUSTION, SUCCESS AND FAILURE

A parse is exhausted when it will accept no more input. In the default input model, the read method indicates this by returning zero.

While a failed parse often becomes exhausted, a exhausted parse is by no means necessarily a failed parse. Many common practical grammars succeed at exactly the point that they become exhausted. Grammars are often written so that once the "find what they are looking for", no other input is acceptable.

EVALUATION

my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';

The value method call evaluates and returns a parse result. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of the next parse result, or undef if there are no more parse results.

These are the named arguments available to the value method call:

end

The value method's end named argument specifies the parse end location. The default is for the parse to end where the input did, so that the parse returned is of the entire input.

closures

The value method's closures named argument is a reference to a hash. In each key/value pair of this hash, the key must be an action name. The hash value must be a CODE ref.

When an action name is a key in the closures named argument, the usual action resolution mechanism of the semantics is bypassed. One common use of the closures named argument is to allow anonymous subroutines to be semantic actions. For more details, see the document on semantics.

max_parses

The value must be an integer. If it is greater than zero, the evaluator will return no more than that number of parse results. If it is zero, there will be no limit on the number of parse results returned. The default is for there to be no limit.

Marpa allows extremely ambiguous grammars. max_parses can be used if the user wants to see only the first few parse results of an ambiguous parse. max_parses is also useful to limit CPU usage and output length when testing and debugging.

trace_actions

The value method's trace_actions named argument is a boolean. If the boolean value is true, Marpa prints tracing information as it resolves action names to Perl closures. A boolean value of false turns tracing off, which is the default. Traces are written to the trace file handle.

trace_values

The value method's trace_values named argument is a numeric trace level. If the numeric trace level is 1, Marpa prints tracing information as values are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle.

COPYRIGHT AND LICENSE

Copyright 2011 Jeffrey Kegler
This file is part of Marpa::PP.  Marpa::PP is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.

Marpa::PP is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser
General Public License along with Marpa::PP.  If not, see
http://www.gnu.org/licenses/.