NAME
Marpa::API::Recognizer - Marpa Recognizer Objects
SYNOPSIS
my $recce = Marpa::Recognizer->new( { grammar => $grammar } );
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', ],
[ 'Number', 1 ],
[ 'Add', ],
[ 'Number', 7 ],
);
$recce->tokens( \@tokens );
DESCRIPTION
The Marpa::Recognizer::new
constructor creates a recognizer object from a precomputed Marpa grammar object. The Marpa::Recognizer::tokens
method recognizes tokens.
Pedantically, recognition and evaluation are distinct things. Recognition is determining whether there is a parse. Evaluation is determining the value of a parse. But pure recognition in the pedantic sense is of little practical interest. Marpa::Recognizer incorporates many evaluation functions.
In particular, Marpa::Recognizer has a built-in "Single Parse Evaluator", which works on unambiguous grammars. The Single Parse Evaluator also works for ambiguous parses if the grammar is not infinitely ambiguous. The Single Parse Evaluator only returns one parse. For ambiguous grammars, the Single Parse Evaluator arbitrarily selects one of the parses.
The Single Parse Evaluator is usually the best choice when there is only one parse. The Single Parse Evaluator is also usually the best choice for an application that only needs one parse, and is not fussy about which one.
Location
In traditional parsing, location is position in a token stream. This document will assume that Marpa is using the traditional, token-stream, model of the input, unless it states otherwise. Marpa supports other input models, which are discussed in a separate document.
The current parse location, or current earleme is the location at which the next input is expected. Intuitively, the current earleme can be thought of as the recognizer's current position. Marpa will have completed its tables (or Earley sets) out to the current earleme, so the current earleme is often referred to as the last completed earleme.
At times, this document will refer to the location of the furthest earleme. Users sticking to the token-stream model can ignore the furthest earleme -- for them it will always be the same as the current earleme. Alternative input models in which the furthest earleme is important, and the reasons for the term earleme itself, are presented in the separate document on alternative input models.
CONSTRUCTOR
new
The new
method's arguments are references to hashes of named arguments. In the key/value pairs of these hashes, the key is the argument name, and the hash value is the value of the argument. The new
method either returns a new parse object or throws an exception. Details of the named arguments are below.
MUTATORS
end_input
Indicates that input is finished. The end_input
method takes no arguments. Calling end_input
is not necessary in the recognizer's default
mode. The end_input
method returns a Perl true value on success. On failure, it throws an exception.
After a successful call to the end_input
method, the current earleme will be the same as the furthest earleme. Any further calls to tokens
will cause an exception to be thrown. The end_input
method can only usefully be called once per recognizer. Subsequent calls to the end_input
method have no effect and return a Perl true.
set
The set
method's arguments are references to hashes of named arguments. It can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.
strip
To optimize reading input and building tables, the recognizer creates several large internal data structures. These are no longer needed when input is finished.
The strip
method removes the recognizer's internal data structures, leaving only the earley sets and other data that is needed for the evaluation phase. Stripping a recognizer greatly reduces the amount of memory it uses. Attempting to strip a recognizer before input is finished will cause an exception.
tokens
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', ],
[ 'Number', 1 ],
[ 'Add', ],
[ 'Number', 7 ],
);
$recce->tokens( \@tokens );
The tokens
method takes two arguments. The first is a reference to an array of token descriptors. The second, optional, argument is a reference to an index into that array, used when the call is interactive.
Tokens descriptors are references to arrays. The elements of the token descriptor array are, in order, the token name, the token value, the token length and the token offset. The token name must be the name of a symbol which is valid as a terminal in the grammar. In a token descriptor, only the token name is required -- the other elements may omitted.
The token value can be any Perl scalar. If omitted, the token value is a Perl undef
.
The token length and the token offset arguments are not used in the traditional, token-stream, model of the input. Their use is described in the document on alternative input models.
Parsing may reach a point where it is exhausted. "Exhausted" means that there is no parse at the current location, and no possible input could produce any parses at a subsequent location. If a recognizer is not exhausted, it is active.
In the evaluators, the default end of parsing is the end of input, and that means that if the evaluators are called using the default end of parsing, they will find no parses in an exhausted recognizer. The only way an exhausted recognizer can produce a valid parse is if end of parsing is set to a location before the end of input.
If the recognizer is exhausted, in scalar context, the tokens
method returns undef. If the recognizer is exhausted, in array context, the tokens
method returns an empty array.
If the recognizer is active, in scalar context, the tokens
method returns the number of the current earleme. If the recognizer is active, in array context, the tokens
method returns an array of two elements. The first element of the return array is the number of the current earleme. The second element is a reference to a list of expected tokens.
The list of expected tokens returned by the tokens
method is used for prediction-driven lexing. It is an array of strings. The strings are the names of the symbols that will be valid as token names at the current parse location. For more detail on how to use this list, see the section on interactive input.
value
The value mutator implements the Single Parse Evaluator. It is described in its own section.
ACCESSOR
check_terminal
Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not usually needed, but in unusual sitations a lexer may find this the easiest way to determine if a symbol is a terminal.
TRACE ACCESSORS
show_earley_sets
print $recce->show_earley_sets()
or Carp::croak "print failed: $OS_ERROR";
Returns a multi-line string listing every Earley item in every Earley set.
NAMED ARGUMENTS
grammar
The grammar
named argument is required. Its value must be a precomputed Marpa grammar object.
mode
The mode
named argument is optional. If present, it must be a string, either "default
" or "stream
".
In default mode (which is the default), only one call to the tokens
method is allowed for a recognizer object. The input is automatically finished after that call, and calling the end_input
method to indicate that input is finished is not necessary.
In stream mode, the tokens
method may be called repeatedly. To indicate that input is finished, it is necessary to call the end_input
method.
too_many_earley_items
The too_many_earley_items
argument is optional. If specified, it sets the earley items warning threshold. If an earley set becomes larger than the earley items warning threshold, a warning is printed to the trace file handle.
Marpa parses from any BNF, and can handle grammars and inputs which produce large earley sets. But parsing that involves large earley sets can be slow. Large earley sets will be something most users can work around and will wish to avoid.
By default, the earley items warning threshold is set to a number calculated based on the size of the grammar, but never less than 100. If the earley items warning threshold is set to 0, warnings about large earley sets are turned off.
trace_earley_sets
A boolean. If true, causes each earley set to be written to the trace file handle as it is completed.
trace_file_handle
The value is a file handle. Trace output and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.
trace_terminals
Traces terminals as they are accepted or rejected by the recognizer. Very handy in debugging, and often useful even when the problem is not in the lexing. Requires little knowledge of Marpa internals to interpret.
warnings
The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.
INTERACTIVE INPUT
RECCE_RESPONSE: for ( my $token_ix = 0;; ) {
my ( $current_earleme, $expected_tokens ) =
$recce->tokens( \@tokens, \$token_ix );
last RECCE_RESPONSE if $token_ix > $#tokens;
fix_things( \@tokens, $expected_tokens )
or Carp::croak(q{Don't know how to fix things});
} ## end for ( my $token_ix = 0;; )
Marpa allows prediction-driven lexing. That is, Marpa can tell the lexer what symbols will be acceptable as tokens at the next location in the parse. This can be very useful.
Interactive input takes place, like all input, via the tokens
method. When a second argument is given to the tokens
method, it is interactive. Interactive calls to the tokens
are only allowed in stream
mode.
In interactive input, the first argument is a reference to a token descriptor array, and the second argument is a reference to a token descriptor index. The tokens
method will read token descriptors starting at the token descriptor index. Elements in the token descriptor array before the token descriptor index are ignored.
In non-interactive calls, if the tokens
method sees a token descriptor for an invalid token, it throws an exception. In interactive calls, tokens
returns and leaves the token descriptor index at the location where it had a problem. If, in an interactive call, tokens
reaches the end of the token descriptor array, the token descriptor index will be set to one past the end of the token descriptor array.
The behavior of the token descriptor index is designed to facilitate loops like the one in the example above. In that loop, the token descriptor index is initialized to zero, and is never directly changed after that. The loop ends which the token descriptor index points past the end of the token descriptor array.
The return values for interactive calls to tokens
are the same as the return values for non-interactive calls. Interactive calls to tokens
can be made in scalar context, but most often applications will want to use the list of expected tokens, which is only available in array context.
Marpa's HTML parser, Marpa::UrHTML, is an example of interactive input applied to a non-trivial, real-life application. When an interactive tokens
call returns due to a problem with a token descriptor, Marpa::UrHTML tries to "fix" things in two ways.
First, it looks at the expected tokens list, trying to find a expected token that it can create as a "virtual" token. If Marpa::UrHTML finds an acceptable virtual token, it will create it "on the fly". Effectively, Marpa::UrHTML is supplying tokens that the parser wants but which are missing in the physical input. Marpa::UrHTML inserts virtual tokens into the input with separate, non-interactive, calls to tokens
. After inserting the virtual tokens, Marpa::UrHTML loops back to the main, interactive, tokens
call, resuming the original input at the point where it left off.
Second, and if the first technique fails, Marpa::UrHTML will change the next token descriptor in the token descriptor array so that it fits the parser's expectations. It then loops back to the main interactive call to tokens
, again resuming where it left off.
The recognizer keeps no state for parts of the token descriptor array which it has yet to reach. This means that an application can be aggressive about adding input on the fly. An application is also free to change the portion of the token descriptors array which has yet to be read. Multiple interactive calls to tokens
can be "in progress" at once. Those interactive calls can be interspersed with as many non-interactive calls as the user wishes.
SINGLE PARSE EVALUATOR
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
The Single Parse Evaluator is implemented as the value
method call. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of a parse, or undef if there was no parse.
If the parse is not ambiguous, the reference returned will be to the value of the only valid parse. If the parse is ambiguous, the reference returned will be to the value of a parse chosen arbitrarily.
These are the named arguments available to the value
method call:
end
value
's end
named argument is the earleme (that is, the location) where the parse should end. The default is for the parse to end where the input did, so that the parse returned is of the entire input.
closure
value
's closure
named argument is a reference to a hash. In the key/value pairs of this hashes, the key must be an action name from the action
properties of a rule; from the default_action
named argument of the grammar; from the ranking_action
property of a rule; or from the ranking_action
property of a symbol. The value in the hashes key/value pair must be a CODE ref.
When an action name is specified in the closure
named argument, the usual action resolution mechanism of the semantics is bypassed. The value of a hash entry in the closure
named argument can be an anonymous subroutine. The most common use of the closure
named argument is to allow anonymous subroutines to be semantics actions. For more details, see the document on semantics.
trace_actions
value
's trace_values
named argument is a boolean. If the boolean value is true, Marpa traces the resolution of semantic actions. A boolean value of false turns tracing off, which is the default. The traces are written to the trace file handle.
trace_values
value
's trace_values
named argument is a numeric trace level. If the numeric trace level is 1 or more, Marpa traces values as they are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. The traces are written to the trace file handle.
Tracing becomes more verbose as the levels increase. At a level of 3, trace_values
becomes extremely verbose, writing out the entire stack every time it changes.
LICENSE AND COPYRIGHT
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.