NAME
Marpa::Recognizer - Marpa Recognizer Objects
SYNOPSIS
my $recce = Marpa::Recognizer->new( { grammar => $grammar } );
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', ],
[ 'Number', 1 ],
[ 'Add', ],
[ 'Number', 7 ],
);
$recce->tokens( \@tokens );
DESCRIPTION
The Marpa::Recognizer::new
constructor creates a recognizer object from a precomputed Marpa grammar object. The Marpa::Recognizer::tokens
method recognizes tokens.
Pedantically, recognition and evaluation are distinct things. Recognition is determining whether there is a parse. Evaluation is determining the value of a parse. The Marpa::Recognizer handles both recognition and evaluation.
Location
In traditional parsing, location is position in a token stream. This document will assume that Marpa is using the traditional, token-stream, model of the input, unless it states otherwise. Marpa supports other input models.
The current parse location, or current earleme, is the location at which the next input is expected. Intuitively, the current earleme can be thought of as the recognizer's current position. Alternative input models, and the reasons for the term earleme itself, are discussed in the document on alternative input models.
Exhaustion
Recognition may reach a point where the recognizer is exhausted. A recognizer is exhausted if there is no parse at the current location, and if no possible additional input could produce parses at any subsequent location. If a recognizer is not exhausted, it is active.
The evaluators look for a parse from the parse start location to the parse end location. By default, the parse start location is the beginning of the input. By default, the parse end location is the end of the input. In other words, the default is to parse the entire input.
When an evaluator is called using the default parse end location, the evaluator will find no parses in an exhausted recognizer. It may be possible for an exhausted recognizer to produce a valid parse if the parse end location is set to a location before the end of input.
CONSTRUCTOR
new
The new
method's arguments are references to hashes of named arguments. In each key/value pair of these hashes, the key is the argument name, and the hash value is the value of the argument. The new
method either returns a new recognizer object or throws an exception. The named arguments are described in a section below.
MUTATORS
end_input
Indicates that input is finished. Calling end_input
is not necessary or useful in the recognizer's default
mode. In stream
mode, the end_input
method is needed to indicate the end of input.
The end_input
method takes no arguments. The end_input
method returns a Perl true value on success. On failure, it throws an exception.
Calling end_input
multiple times on the same recognizer object is harmless, but useless. The second and subsequent calls will return a Perl true, but will have no effect. Once end_input
has been called, calls to tokens
on the same recognizer object will cause an exception to be thrown.
set
The set
method's arguments are references to hashes of named arguments. The set
method can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.
strip
The recognizer generates a considerable amount of internal data while building tables. These internal data are no longer needed when input is finished and the recognizer's tables are complete.
The strip
method deletes much of the recognizer's internal data, but leaves the tables and other data needed for the evaluation phase. Stripping a recognizer greatly reduces the amount of memory it uses. Attempting to strip a recognizer before input is finished will cause an exception.
tokens
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', ],
[ 'Number', 1 ],
[ 'Add', ],
[ 'Number', 7 ],
);
$recce->tokens( \@tokens );
The tokens
method takes two arguments. The first is a reference to an array of token descriptors. The second, optional, argument is a reference to an index into that array, used when the call is interactive.
Tokens descriptors are also references to arrays. The elements in each token descriptor are, in order:
Token name
Required. Must be the name of a valid terminal symbol. For details about terminal symbols, see "Terminals" in Marpa::Grammar.
Token value
Optional. Can be any Perl scalar. If omitted, the token value is a Perl
undef
.Token length
Optional, and not used in the traditional, token-stream, model of the input. Token length is described in the document on alternative input models.
Token offset
Optional, and not used in the traditional, token-stream, model of the input. Token offset is described in the document on alternative input models.
In scalar context, if the recognizer remains active, the tokens
method returns the number of the current earleme. In array context, if the recognizer remains active, the tokens
method returns an array of two elements. The first element of the returned array is the number of the current earleme. The second element is a reference to a list of expected tokens.
The list of expected tokens returned by the tokens
method is used for prediction-driven lexing. It is an array of strings. The strings are the names of the symbols that are valid as token names at the current parse location. For detail on how to use the list of expected tokens, see the section on interactive input.
When a call to the tokens
method in scalar context exhausts the recognizer, the tokens
method returns undef
. When a call to the tokens
method in array context exhausts the recognizer, the tokens
method returns the empty array.
value
The value mutator evaluates and returns a parse result. It is described in its own section.
ACCESSOR
check_terminal
Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not often needed, but in special sitations a lexer may find this the most convenient way to determine if a symbol is a terminal.
status
The return value of the status method is that same as that of the "tokens" method. For more details, see the description of tokens
method.
The status
method can be used to check the status of the recognizer without calling the tokens
method. The tokens
method is the only way to get the list of expected tokens at earleme 0, before the first call of the tokens
method.
TRACE ACCESSORS
show_earley_sets
print $recce->show_earley_sets()
or die "print failed: $ERRNO";
Returns a multi-line string listing every Earley item in every Earley set. show_earley_sets
requires knowledge of Marpa internals to interpret.
For debugging grammars, users will want to use show_progress
instead. show_progress
contains the information necessary for debugging grammars and interpreting parse progress.
show_progress
print $recce->show_progress()
or die "print failed: $ERRNO";
Returns a string describing the progress of the parse. With no arguments, the string contains reports for the last completed earleme, which is typically the one of most interest. With a non-negative argument N, the string contains reports for earleme N.
With two numeric arguments, N and M, the arguments are interpreted as a range of earlemes and the returned string contains reports for all earlemes from N to M. The first argument must be non-negative. If the second argument is a negative integer, "-M", it indicates the Mth earleme from the last. In other words, -1 is the last earleme, -2 the next to last, etc. The call $recce->show_progress(0, -1)
will print progress reports for the entire parse.
show_progress
is an important tool for debugging application grammars. It can also be used to track the progress of a parse or to investigate how a parse works. Details are in the document on debugging Marpa grammars.
NAMED ARGUMENTS
grammar
The grammar
named argument is required. Its value must be a precomputed Marpa grammar object.
mode
The mode
named argument is optional. If present, it must be a string, either "default
" or "stream
".
In stream
mode, the tokens
method may be called repeatedly. To indicate that input is finished, it is necessary to call the end_input
method.
In default
mode, only one call to the tokens
method is allowed for each recognizer object. The input is automatically finished after that one call. In default
mode, the end_input
method is not useful. As the name indicates, default
mode is the default.
ranking_method
The value must be a string: either "none
" or "constant
". When the value is "none
", Marpa returns the parse results in arbitrary order. When the value is "constant
", Marpa allows the user to control the order in which parse results are returned by specifying ranking actions which assign values to rules and tokens.
The default is for parse results to be returned in arbitrary order. For details, see the section on ranking in the document on semantics.
too_many_earley_items
The too_many_earley_items
argument is optional. If specified, it sets the Earley item warning threshold. If an Earley set becomes larger than the Earley item warning threshold, a warning is printed to the trace file handle.
Marpa parses from any BNF, and can handle grammars and inputs which produce large Earley sets. But parsing that involves large Earley sets can be slow. Large Earley sets are something most applications can, and will wish to, avoid.
By default, Marpa calculates an Earley item warning threshold based on the size of the grammar. The default threshold will never be less than 100. If the Earley item warning threshold is set to 0, warnings about large Earley sets are turned off. For details about Earley sets, see the implementation document.
trace_earley_sets
A boolean. If true, causes each Earley set to be written to the trace file handle as it is completed. For details about Earley sets, see the implementation document.
trace_file_handle
The value is a file handle. Traces and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.
trace_terminals
Very handy in debugging, and often useful even when the problem is not in the lexing. The value is a trace level. When the trace level is 0, tracing of terminals is off. This is the default.
At a trace level of 1 or higher, Marpa traces terminals as they are accepted or rejected by the recognizer. At a trace level of 2 or higher, Marpa traces the terminals expected at every earleme. Practical grammars often expect a large number of different terminals at many locations, so the output from a trace level of 2 can be voluminous.
warnings
The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.
INTERACTIVE INPUT
RECCE_RESPONSE: for ( my $token_ix = 0;; ) {
my ( $current_earleme, $expected_tokens ) =
$recce->tokens( \@tokens, \$token_ix );
last RECCE_RESPONSE if $token_ix > $#tokens;
fix_things( \@tokens, $expected_tokens )
or die q{Don't know how to fix things};
} ## end for ( my $token_ix = 0;; )
Marpa allows prediction-driven lexing. In other words, Marpa can tell the lexer which symbols are acceptable as tokens at the next location in the parse. This can be very useful.
Interactive input takes place, like all input, via the tokens
method. If a call of the tokens
method passes a second argument, that call of the tokens
method is interactive. Interactive calls to the tokens
method are allowed only on recognizers that are in stream
mode.
In interactive input, the first argument is a reference to a token descriptor array, and the second argument is a reference to a token descriptor index. The tokens
method will read token descriptors starting at the token descriptor index. Token descriptors before the token descriptor index are ignored.
An unexpected token descriptor is a token descriptor whose token name is not the name of one of the terminal symbols expected at that point in the input. The token represented by an unexpected token descriptor is an unexpected token. In non-interactive calls to the tokens
method, an unexpected token descriptor causes an exception to be thrown.
In interactive calls, the tokens
method returns leaving the token descriptor index pointing to the first unexpected token descriptor. If there were no unexpected token descriptors, the token descriptor index will be set to one past the end of the token descriptor array.
The behavior of the token descriptor index is designed to facilitate loops like the one in the example above. In that loop, the token descriptor index is initialized to zero, and is never directly modified after that. The loop ends when the token descriptor index points past the end of the token descriptor array.
The return values for interactive calls to tokens
are the same as the return values for non-interactive calls. Interactive calls to tokens
can be made in scalar context, but most often applications will want to use the list of expected tokens, which is only returned in array context.
Mixing tokens Calls
Recognizers in stream
mode are very flexible. Both interactive and non-interactive tokens
method calls can be made to the same recognizer. A single recognizer can read input from any number of token descriptor arrays. Between calls to tokens
, the application is allowed to edit the token descriptor array, and to change the token descriptor index.
This flexiblity is possible because the recognizer is stateless with respect to token descriptor arrays and their token descriptor indexes. The tokens
method does not keep track of which token descriptor arrays it has seen. It does not keep track of which token descriptors it has already read. It does not keep track of where the token descriptor indexes have been in the past. Statelessness means that, regardless of what an application does with its token descriptor arrays, it is impossible to make them inconsistent with the recognizer's internal data.
An Example
Marpa's HTML parser, Marpa::HTML, is an example of how this flexibility can help with a non-trivial, real-life application. When an interactive tokens
call returns due to an invalid token name, Marpa::HTML tries to fix things in two ways.
First, it looks at the expected tokens list for the name of a token that will work as a "virtual" token. If Marpa::HTML finds an acceptable virtual token name, it will create a token descriptor for it "on the fly". In effect, Marpa::HTML supplies tokens that the parser wants but which are missing in the physical input. Marpa::HTML inserts the virtual tokens into the input using non-interactive tokens
method calls. After inserting the virtual tokens, Marpa::HTML loops back to the main, interactive, tokens
call, resuming the original input at the point where it left off.
Second, under some circumstances Marpa::HTML will change the next token descriptor in the token descriptor array to fit the parser's expectations. Marpa::HTML then loops back to the main interactive call to tokens
. When the main interactive call to tokens
resumes, the token descriptor index is unchanged, but points to an altered token descriptor.
EVALUATION
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
The value
method call evaluates and returns a parse result. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of the next parse result, or undef if there are no more parse results.
These are the named arguments available to the value
method call:
end
The value
method's end
named argument specifies the parse end location. The default is for the parse to end where the input did, so that the parse returned is of the entire input.
closures
The value
method's closures
named argument is a reference to a hash. In each key/value pair of this hash, the key must be an action name. The hash value must be a CODE ref.
Sources of action names include
The
action
properties of rulesThe
default_action
named argument of grammarsThe
lhs
properties of rulesThe
ranking_action
properties of rulesFor its
new
method, theaction_object
named argument of grammars
When an action name is a key in the closures
named argument, the usual action resolution mechanism of the semantics is bypassed. A common use of the closures
named argument is to allow anonymous subroutines to be semantic actions. For more details, see the document on semantics.
max_parses
The value must be an integer. If it is greater than zero, the evaluator will return no more than that number of parse results. If it is zero, there will be no limit on the number of parse results returned. The default is for there to be no limit.
Marpa allows extremely ambiguous grammars. max_parses
can be used if the user only wants to see the first few parse results of an ambiguous parse. max_parses
is also useful to limit CPU usage and output length when testing and debugging.
ranking_method
The value must be a string: either "none
" or "constant
". When the value is "none
", Marpa returns the parse results in arbitrary order. When the value is "constant
", Marpa allows the user to control the order in which parse results are returned by specifying ranking actions. The default is for parse results to be returned in arbitrary order. For details, see the section on parse order in the semantics document.
trace_actions
The value
method's trace_actions
named argument is a boolean. If the boolean value is true, Marpa traces the resolution of action names to Perl closures. A boolean value of false turns tracing off, which is the default. Traces are written to the trace file handle.
trace_values
The value
method's trace_values
named argument is a numeric trace level. If the numeric trace level is 1, Marpa traces values as they are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle.
LICENSE AND COPYRIGHT
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.