NAME
Marpa::R2::Recognizer - Marpa recognizers
Synopsis
my $recce = Marpa::R2::Recognizer->new( { grammar => $grammar } );
$recce->read( 'Number', 42 );
$recce->read('Multiply');
$recce->read( 'Number', 1 );
$recce->read('Add');
$recce->read( 'Number', 7 );
Description
To create a recognizer object, use the new
method.
To read input, use the read
method.
To evaluate a parse tree, based on the input, use the value
method.
Token streams
By default, Marpa uses the token-stream model of input. The token-stream model is standard -- so standard the most documents about parsing do not bother to describe it. In the token-stream model, each read adds a token at the current location, then advances the current location by one. The location before any input is numbered 0 and if N tokens are parsed, they fill the locations from 1 to N.
This document will describe only the token-stream model of input. Marpa allows other models of the input, but their use requires special method calls, which are described in the document on alternative input models.
Constructor
new
my $recce = Marpa::R2::Recognizer->new( { grammar => $grammar } );
The new
method creates a recognizer object. The new
method either returns a new recognizer object or throws an exception.
The arguments to the new
method are references to hashes of named arguments. In each key/value pair of these hashes, the key is the argument name, and the hash value is the value of the argument. The named arguments are described below.
Accessors
check_terminal
my $is_symbol_a_terminal = $recce->check_terminal('Document');
Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not often needed.
latest_earley_set
my $latest_earley_set = $recce->latest_earley_set();
Return the location of the latest (in other words, the most recent) Earley set. In the places where it is most often needed, the latest Earley set is the default, and there is usually no need to request the explicit value of the latest Earley set.
progress
Given the location (Earley set ID) as its argument, returns an array that describes the parse progress at that location. Details on progress reports can be found in their own document.
terminals_expected
my $terminals_expected = $recce->terminals_expected();
Returns a reference to a list of strings, where the strings are the names of the terminals acceptable at the current location. In the default input model, the presence of a terminal in this list means that terminal will be acceptable in the next read
method call. This is highly useful for Ruby Slippers parsing.
exhausted
$recce->exhausted() and die 'Recognizer exhausted';
The exhausted
method returns a Perl true if parsing in a recognizer is exhausted, and a Perl false otherwise. Parsing is exhausted when the recognizer will not accept any further input. By default, a recognizer event occurs if parsing is exhausted. An attempt to read input into an exhausted parser causes an exception to be thrown. The recognizer event and the exception are all that many applications require, but this method allows the recognizer's exhaustion status to be discovered directly.
Mutators
read
$recce->read( 'Number', 42 );
$recce->read('Multiply');
$recce->read( 'Number', 1 );
$recce->read('Add');
$recce->read( 'Number', 7 );
The read
method reads one token at the current parse location. It then advances the current location by 1.
read
takes two arguments: a token name and a token value. The token name is required. It must be the name of a valid terminal symbol. The token value is optional. It defaults to a "whatever" value. Here "whatever" means the value may vary from instance to instance, and cannot be relied on in any way. For details about terminal symbols, see "Terminals" in Marpa::R2::Grammar.
The parser may accept or reject the token. If the parser accepted the token, the read
method returns the number of recognizer events that occurred during the read
. Only two events are enabled by default -- exceeding the Earley item warning threshold, and exhaustion. For more about the Earley item warning threshold, see "too_many_earley_items". "Exhaustion" means that the next read
call must fail, because there is no token that will be acceptable to it. More details on "exhaustion" are in a section below.
Marpa may reject a token because it is not one of those acceptable at the current location. When this happens, read
returns a Perl undef
. A rejected token need not end parsing -- it is perfectly possible to retry the read
call with another token. This is, in fact, an important technique in Ruby Slippers parsing. For details, see the section on Ruby Slippers parsing.
For other failures, including an attempt to read
a token into an exhausted parser, Marpa throws an exception.
Note that passing an explicit undef
as the token value argument is quite different from omitting it. If the token value is omitted, it is a "whatever" value, one which could be anything. If it is undef
, then the token value is always a Perl undef
.
set
$recce->set( { max_parses => 10, } );
The set
method's arguments are references to hashes of named arguments. The set
method can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.
value
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
Because Marpa parses ambiguous grammars, every parse is a series of zero or more parse trees. There are zero parse trees if there was no valid parse of the input according to the grammar.
The value
method call evaluates the next parse tree in the parse series, and returns a reference to the parse result for that parse tree. If there are no more parse trees, the value
method returns undef
.
Trace accessors
show_progress
print $recce->show_progress()
or die "print failed: $ERRNO";
Returns a string describing the progress of the parse. With no arguments, the string contains reports for the current location. With a single integer argument N, the string contains reports for location N. With two numeric arguments, N and M, the arguments are interpreted as a range of locations and the returned string contains reports for all locations in the range. ("Location" as referred to in this section, and elsewhere in this document, is what is also called the Earley set ID.)
If an argument is negative, -N, it indicates the Nth location counting backward from the furthest location of the parse. For example, if 42 was the furthest location, -1 would be location 42 and -2 would be location 41. For example, the method call $recce->show_progress(-3, -1)
returns reports for the last three locations of the parse. The method call $recce->show_progress(0, -1)
will print progress reports for the entire parse.
show_progress
is Marpa's most powerful tool for debugging application grammars. It can also be used to track the progress of a parse or to investigate how a parse works. A much fuller description, with an example, is in the document on debugging Marpa grammars.
Named arguments
The recognizer's named arguments are accepted by its new
and set
methods.
closures
The value of closures
named argument must be a reference to a hash. In each key/value pair of this hash, the key must be an action name. The hash value must be a CODE ref. The closures
named argument is not allowed once evaluation has begun.
When an action name is a key in the closures
named argument, the usual action resolution mechanism of the semantics is bypassed. One common use of the closures
named argument is to allow anonymous subroutines to be semantic actions. For more details, see the document on semantics.
end
The end
named argument specifies the parse end location. The default is for the parse to end where the input did, so that the parse returned is of the entire input. The end
named argument is not allowed once evaluation has begun. "Location" as referred to here and elsewhere in this document is what is also called an Earley set ID.
grammar
The new
method is required to have a grammar
named argument. Its value must be a precomputed Marpa grammar object. The grammar
named argument is not allowed anywhere else.
max_parses
The value must be an integer. If it is greater than zero, the evaluator will return no more than that number of parse results. If it is zero, there will be no limit on the number of parse results returned. The default is for there to be no limit.
Marpa allows extremely ambiguous grammars. max_parses
can be used if the user wants to see only the first few parse results of an ambiguous parse. max_parses
is also useful to limit CPU usage and output length when testing and debugging.
ranking_method
The value must be a string: one of "none
", "rule
", or "high_rule_only
". When the value is "none
", Marpa returns the parse results in arbitrary order. This is the default. The ranking_method
named argument is not allowed once evaluation has begun.
The "rule
" and "high_rule_only
" ranking methods allows the user to control the order in which parse results are returned by the value
method, and to exclude some parse results from the parse series. For details, see the document on parse order.
too_many_earley_items
The too_many_earley_items
argument is optional. If specified, it sets the Earley item warning threshold. If an Earley set becomes larger than the Earley item warning threshold, a warning is printed to the trace file handle.
Marpa parses from any BNF, and can handle grammars and inputs which produce large Earley sets. But parsing that involves large Earley sets can be slow. Large Earley sets are something most applications can, and will wish to, avoid.
By default, Marpa calculates an Earley item warning threshold based on the size of the grammar. The default threshold will never be less than 100. If the Earley item warning threshold is set to 0, warnings about large Earley sets are turned off.
trace_actions
The trace_actions
named argument is a boolean. If the boolean value is true, Marpa prints tracing information as it resolves action names to Perl closures. A boolean value of false turns tracing off, which is the default. Traces are written to the trace file handle.
trace_file_handle
The value is a file handle. Traces and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.
trace_terminals
Very handy in debugging, and often useful even when the problem is not in the lexing. The value is a trace level. When the trace level is 0, tracing of terminals is off. This is the default.
At a trace level of 1 or higher, Marpa produces a trace message for each terminal as it is accepted or rejected by the recognizer. At a trace level of 2 or higher, the trace messages include, for every location, a list of the terminals expected. In practical grammars, output from trace level 2 can be voluminous.
trace_values
The trace_values
named argument is a numeric trace level. If the numeric trace level is 1, Marpa prints tracing information as values are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle.
warnings
The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.
Ruby Slippers parsing
$recce = Marpa::R2::Recognizer->new( { grammar => $grammar } );
my @tokens = (
[ 'Number', 42 ],
['Multiply'], [ 'Number', 1 ],
['Add'], [ 'Number', 7 ],
);
TOKEN: for ( my $token_ix = 0; $token_ix <= $#tokens; $token_ix++ ) {
defined $recce->read( @{ $tokens[$token_ix] } )
or fix_things( $recce, $token_ix, \@tokens )
or die q{Don't know how to fix things};
}
Marpa is able to tell the application which symbols are acceptable as tokens at the next location in the parse. The terminals_expected
method returns the list of tokens that will be accepted by the next read
. The application can use this information to change the input "on the fly" so that it is acceptable to the parser.
An application can also take a "try it and see" approach. If an application is not sure whether a token is acceptable or not, the application can try to read the dubious token using the read
method. If the token is rejected, the read
method call will return a Perl undef
. At that point, the application can retry the read
with a different token.
An example
Marpa's HTML parser, Marpa::HTML, is an example of how Ruby Slippers parsing can help with a non-trivial, real-life application. When a token is rejected in Marpa::HTML, it changes the input to match the parser's expectations by
Modifying existing tokens, and
Creating new tokens.
The second technique, the creation of new "virtual" tokens, is used by Marpa::HTML to deal with omitted start and end tags. The actual HTML grammar that Marpa::HTML uses takes an oversimplified view of the HTML -- it assumes, even when the HTML standards do not require it, that start and end tags are always present. For most HTML files of interest, this assumption will be contrary to fact.
Ruby Slippers parsing is used to make the grammar's over-simplistic view of the world come true for it. Whenever a token is rejected, Marpa::HTML looks at the expected tokens list. If it sees that a start or end tag is expected, Marpa::HTML creates a token for it -- a completely new "virtual" token that gives the parser exactly what it expects. Marpa::HTML then resumes input at the point in the original input stream where it left off.
Parse exhaustion
A parse is exhausted when it will accept no more input. An exhausted parse is not necessarily a failed parse. Grammars are often written so that once they "find what they are looking for", no further input is acceptable. Grammars of that kind become exhausted when they succeed.
By default, a recognizer event occurs whenever the parse is exhausted. An application can also check for exhaustion explicitly, using the recognizer's exhausted
method.
Copyright and License
Copyright 2012 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.