NAME

Marpa::Recognizer - Marpa Recognizer Objects

SYNOPSIS

my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

my @tokens = (
    [ 'Number', 42 ],
    [ 'Multiply', ],
    [ 'Number', 1 ],
    [ 'Add', ],
    [ 'Number', 7 ],
);

$recce->tokens( \@tokens );

DESCRIPTION

The Marpa::Recognizer::new constructor creates a recognizer object from a precomputed Marpa grammar object. The Marpa::Recognizer::tokens method recognizes tokens.

Pedantically, recognition and evaluation are distinct things. Recognition is determining whether there is a parse. Evaluation is determining the value of a parse. In practice the two are usually intertwined. A lot of what Marpa::Recognizer does can be regarded as "evaluation".

In particular, Marpa::Recognizer has a built-in Single Parse Evaluator, which works on unambiguous grammars. The Single Parse Evaluator also works for ambiguous parses if the grammar is not infinitely ambiguous. The Single Parse Evaluator only returns one parse. For ambiguous grammars, the Single Parse Evaluator arbitrarily selects a parse.

The Single Parse Evaluator is usually the best choice when there is only one parse. The Single Parse Evaluator is also usually the best choice for an application that only needs one parse, and is not fussy about which one.

Location

In traditional parsing, location is position in a token stream. This document will assume that Marpa is using the traditional, token-stream, model of the input, unless it states otherwise. Marpa supports other input models, which are discussed in a separate document.

The current parse location, or current earleme, is the location at which the next input is expected. Intuitively, the current earleme can be thought of as the recognizer's current position. Alternative input models, and the reasons for the term earleme itself, are presented in the document on alternative input models.

Exhaustion

Recognition may reach a point where the recognizer is exhausted. A recognizer is exhausted if there is no parse at the current location, and if no possible additional input could produce parses at any subsequent location. If a recognizer is not exhausted, it is active.

The evaluators look for a parse from the parse start location to the parse end location. By default, the parse start location is the beginning of the input. By default, the parse end location is the end of the input. In other words, the default is to parse the entire input.

When an evaluator is called using the default parse end location, the evaluator will find no parses in an exhausted recognizer. It may be possible for an exhausted recognizer to produce a valid parse when the parse end location is set to a location before the end of input.

Failed Parses

Failed parses do not necessarily show up in the recognizer. Despite its name, Marpa's recognizer does not actually indicate whether or not it recognized its input. Recognition is not decided until an evaluator is called. There are two reasons for this.

  • The location of the end of parsing is not known for certain until the evaluator is called. Whether an input is recognized or not cannot be decided until the parse end location is known.

  • Recognition for its own sake is an academic issue. Applications are concerned about whether there were any parse results. And producing parse results is the job of the evaluators.

CONSTRUCTOR

new

The new method's arguments are references to hashes of named arguments. In the key/value pairs of these hashes, the key is the argument name, and the hash value is the value of the argument. The new method either returns a new recognizer object or throws an exception. The named arguments are described in a section below.

MUTATORS

end_input

Indicates that input is finished. The end_input method takes no arguments. Calling end_input is not useful in the recognizer's default mode. The end_input method returns a Perl true value on success. On failure, it throws an exception.

In stream mode, the end_input method is needed to indicate the end of input. Calling end_input multiple times on the same recognizer object is harmless, but useless. The second and subsequent calls will return a Perl true, but will have no effect. Once end_input has been called, calls to tokens on the same recognizer object will cause an exception to be thrown.

set

The set method's arguments are references to hashes of named arguments. The set method can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.

strip

The recognizer generates a considerable amount of internal data while building tables. These internal data are no longer needed when input is finished and the recognizer's tables are complete.

The strip method deletes much of the recognizer's internal data, but leaves the tables and other data needed for the evaluation phase. Stripping a recognizer greatly reduces the amount of memory it uses. Attempting to strip a recognizer before input is finished will cause an exception.

tokens

my @tokens = (
    [ 'Number', 42 ],
    [ 'Multiply', ],
    [ 'Number', 1 ],
    [ 'Add', ],
    [ 'Number', 7 ],
);

$recce->tokens( \@tokens );

The tokens method takes two arguments. The first is a reference to an array of token descriptors. The second, optional, argument is a reference to an index into that array, used when the call is interactive.

Tokens descriptors are also references to arrays. The elements in each token descriptor are, in order:

  • Token name

    Required. Must be the name of a valid terminal symbol. For details about terminal symbols, see "Terminals" in Marpa::Grammar.

  • Token value

    Optional. Can be any Perl scalar. If omitted, the token value is a Perl undef.

  • Token length

    Optional, and not used in the traditional, token-stream, model of the input. Token length is described in the document on alternative input models.

  • Token offset

    Optional, and not used in the traditional, token-stream, model of the input. Token offset is described in the document on alternative input models.

In scalar context, if the recognizer remains active, the tokens method returns the number of the current earleme. In array context, if the recognizer remains active, the tokens method returns an array of two elements. The first element of the returned array is the number of the current earleme. The second element is a reference to a list of expected tokens.

The list of expected tokens returned by the tokens method is used for prediction-driven lexing. It is an array of strings. The strings are the names of the symbols that are valid as token names at the current parse location. For detail on how to use the list of expected tokens, see the section on interactive input.

When a call to the tokens method in scalar context exhausts the recognizer, the tokens method returns undef. When a call to the tokens method in array context exhausts the recognizer, the tokens method returns the empty array.

value

The value mutator implements the Single Parse Evaluator. It is described in its own section.

ACCESSOR

check_terminal

Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not usually needed, but in unusual sitations a lexer may find this the easiest way to determine if a symbol is a terminal.

TRACE ACCESSORS

show_earley_sets

print $recce->show_earley_sets()
    or Carp::croak "print failed: $OS_ERROR";

Returns a multi-line string listing every Earley item in every Earley set.

NAMED ARGUMENTS

grammar

The grammar named argument is required. Its value must be a precomputed Marpa grammar object.

mode

The mode named argument is optional. If present, it must be a string, either "default" or "stream".

In stream mode, the tokens method may be called repeatedly. To indicate that input is finished, it is necessary to call the end_input method.

In default mode, only one call to the tokens method is allowed for each recognizer object. The input is automatically finished after that one call. In default mode, the end_input method is not useful. As the name indicates, default mode is the default.

too_many_earley_items

The too_many_earley_items argument is optional. If specified, it sets the earley item warning threshold. If an earley set becomes larger than the earley item warning threshold, a warning is printed to the trace file handle.

Marpa parses from any BNF, and can handle grammars and inputs which produce large earley sets. But parsing that involves large earley sets can be slow. Large earley sets are something most users can, and will wish to, avoid.

By default, Marpa calculates an earley item warning threshold based on the size of the grammar. This default, calculated, threshold will never be less than 100. If the earley item warning threshold is set to 0, warnings about large earley sets are turned off. For details about earley sets, see the implementation document.

trace_earley_sets

A boolean. If true, causes each earley set to be written to the trace file handle as it is completed. For details about earley sets, see the implementation document.

trace_file_handle

The value is a file handle. Traces and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.

trace_terminals

Traces terminals as they are accepted or rejected by the recognizer. Very handy in debugging, and often useful even when the problem is not in the lexing. Requires little knowledge of Marpa internals to interpret.

warnings

The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.

INTERACTIVE INPUT

RECCE_RESPONSE: for ( my $token_ix = 0;; ) {

    my ( $current_earleme, $expected_tokens ) =
        $recce->tokens( \@tokens, \$token_ix );

    last RECCE_RESPONSE if $token_ix > $#tokens;

    fix_things( \@tokens, $expected_tokens )
        or Carp::croak(q{Don't know how to fix things});

} ## end for ( my $token_ix = 0;; )

Marpa allows prediction-driven lexing. That is, Marpa can tell the lexer which symbols are acceptable as tokens at the next location in the parse. This can be very useful.

Interactive input takes place, like all input, via the tokens method. If a call of the tokens method passes a second argument, that call of the tokens method is interactive. Interactive calls to the tokens method are allowed only on recognizers that are in stream mode.

In interactive input, the first argument is a reference to a token descriptor array, and the second argument is a reference to a token descriptor index. The tokens method will read token descriptors starting at the token descriptor index. Token descriptors before the token descriptor index are ignored.

In interactive calls, if a token descriptor has a token name that is not a valid terminal, the tokens method returns, leaving the token descriptor index where it was when tokens had the problem. (In non-interactive calls, an invalid token name causes an exception to be thrown.) If tokens reads the entire token descriptor array without a problem, the token descriptor index will be set to one past the end of the token descriptor array.

The behavior of the token descriptor index is designed to facilitate loops like the one in the example above. In that loop, the token descriptor index is initialized to zero, and is never directly modified after that. The loop ends when the token descriptor index points past the end of the token descriptor array.

The return values for interactive calls to tokens are the same as the return values for non-interactive calls. Interactive calls to tokens can be made in scalar context, but most often applications will want to use the list of expected tokens, which is only returned in array context.

Mixing tokens Calls

Recognizers in stream mode are very flexible. Both interactive and non-interactive tokens method calls can be made to the same recognizer. A single recognizer can read input from any number of token descriptor arrays. Between tokens's calls, the application is allowed to edit the token descriptor array, and to change the token descriptor index.

This flexiblity is possible because the recognizer is stateless with respect to token descriptor arrays and their token descriptor indexes. The tokens method does not keep track of which token descriptor arrays it has seen. It does not keep track of which token descriptors it has already read. It does not keep track of where the token descriptor indexes have been in the past. Statelessness means that, regardless of what an application does with its token descriptor arrays, it cannot make them inconsistent with the recognizer's internal data.

An Example

Marpa's HTML parser, Marpa::UrHTML, is an example of how this flexibility can help with a non-trivial, real-life application. When an interactive tokens call returns due to an invalid token name, Marpa::UrHTML tries to fix things in two ways.

First, it looks at the expected tokens list for the name of a token that will work as a "virtual" token. If Marpa::UrHTML finds an acceptable virtual token name, it will create a token descriptor for it "on the fly". In effect, Marpa::UrHTML supplies tokens that the parser wants but which are missing in the physical input. Marpa::UrHTML inserts the virtual tokens into the input using non-interactive tokens method calls. After inserting the virtual tokens, Marpa::UrHTML loops back to the main, interactive, tokens call, resuming the original input at the point where it left off.

Second, under some circumstances Marpa::UrHTML will change the next token descriptor in the token descriptor array to fit the parser's expectations. Marpa::UrHTML then loops back to the main interactive call to tokens. When the main interactive call to tokens resumes, the token descriptor index is unchanged, but points to an altered token descriptor.

SINGLE PARSE EVALUATOR

my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';

The Single Parse Evaluator is implemented as the value method call. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of a parse, or undef if there was no parse.

If the parse is not ambiguous, the reference returned will be to the value of the only valid parse. If the parse is ambiguous, the reference returned will be to the value of a parse chosen arbitrarily.

These are the named arguments available to the value method call:

end

The value method's end named argument specifies the parse end location. The default is for the parse to end where the input did, so that the parse returned is of the entire input.

closures

The value method's closures named argument is a reference to a hash. In the key/value pairs of this hash, the key must be an action name. The hash value must be a CODE ref.

Sources of action names include

  • The action properties of rules

  • The default_action named argument of grammars

  • The lhs properties of rules

  • The ranking_action properties of rules

  • The ranking_action properties of symbols

  • For its new method, the action_object named argument of grammars

When an action name is a key in the closures named argument, the usual action resolution mechanism of the semantics is bypassed. A common use of the closures named argument is to allow anonymous subroutines to be semantic actions. For more details, see the document on semantics.

trace_actions

The value method's trace_actions named argument is a boolean. If the boolean value is true, Marpa traces the resolution of action names to Perl closures. A boolean value of false turns tracing off, which is the default. Traces are written to the trace file handle.

trace_values

The value method's trace_values named argument is a numeric trace level. If the numeric trace level is 1, Marpa traces values as they are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle.

LICENSE AND COPYRIGHT

Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.