NAME

Parse::Marpa::Recognizer - Marpa Recognizer Objects

SYNOPSIS

my $recce = new Parse::Marpa::Recognizer( { grammar => $grammar } );

my $fail_offset = $recce->text( '2-0*3+1' );
if ( $fail_offset >= 0 ) {
    die("Parse failed at offset $fail_offset");
}

    my $recce = new Parse::Marpa::Recognizer({grammar => $grammar});

    my $op = $grammar->get_symbol("Op");
    my $number = $grammar->get_symbol("Number");

    my @tokens = (
	[$number, 2, 1],
	[$op, "-", 1],
	[$number, 0, 1],
	[$op, "*", 1],
	[$number, 3, 1],
	[$op, "+", 1],
	[$number, 1, 1],
    );

    TOKEN: for my $token (@tokens) {
	next TOKEN if $recce->earleme($token);
	die("Parse exhausted at character: ", $token->[1]);
    }

    $recce->end_input();

DESCRIPTION

Marpa parsing takes place in three major phases: grammar creation, input recognition and parse evaluation. Once a grammar object has rules, a recognizer object can be created from it. The recognizer accepts input and can be used to create a Marpa evaluator object.

Tokens and Earlemes

Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Marpa tokens can be of various lengths. Marpa tokens can even overlap.

For most parsers, position is location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location. This flexibility is provided by tracking parse position in earlemes. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage.

If you do your lexing with the text method, you will use a one-character-per-earleme model. The raw input to the parse will be a string made up from the series of strings and string references provided as arguments in calls to text. Each earleme corresponds to a character in one of those strings.

Marpa is not restricted to the one-character-per-earleme model. With the earleme method, you can structure your input in almost any way you like. You can, for example, create a token stream and use a one-token-per-earleme model, and this would be equivalent to the way things are typically done in other parsers. Marpa also allows you to structure your input in special ways to suit particular applications.

There are three restrictions on mapping tokens to earlemes:

  1. Scanning always starts at earleme 0.

  2. Tokens must be scanned in earleme order. That is, all the tokens at earleme N must be scanned before any token at earleme N+1.

  3. Tokens cannot be zero or negative in earleme length.

Earleme number N, or earleme N means the location N earlemes after earleme 0. Length in earlemes means what you expect it does. The length from earleme 3 to earleme 6, for instance, is 3 earlemes.

When a token is scanned, the start of the token is put at the current earleme. The end of the token is at earleme number c+l, where c is the location number of the current earleme, and l is the length of the token. The length of the token must be greater than zero.

The default end of parsing is tracked by each recognizer. The default end of parsing is at earleme 0 when the recognizer is created. It is incremented on calls to the text and earleme methods, as described below in the sections for those methods. When an evaluator object is created from a recognizer object, it inherits the recognizer object's default end of parsing.

Parse Exhaustion

In recognizing input, a point may come when it is clear that a successful parse is no longer possible. At this point, both the parse and the recognizer are said to be exhausted. A parse or a recognizer is active, if and only if it is not exhausted.

Because tokens can span earlemes, parses in Marpa can remain active even if no token either ends or begins at the current earleme. Marpa parses often contain long stretches of earlemes where no token either begins or ends.

METHODS

new

my $recce = new Parse::Marpa::Recognizer({
   grammar=> $grammar,
   lex_preamble => $new_lex_preamble,
});

The new method's one, required, argument is a hash reference of named arguments. The new method either returns a new parse object or throws an exception. Either the compiled_grammar or the grammar named argument must be specified, but not both. A recognizer is created with the current earleme and the default end of parsing both set at earleme 0.

If the grammar option is specified, its value must be a grammar object with rules defined. If it is not precomputed, new will precompute it. A deep copy of the grammar is then made to be used in the recognizer.

If the compiled_grammar option is specified, its value must be a Perl 5 string containing a compiled Marpa grammar, as produced by Parse::Marpa::Grammar::compile. It will be decompiled for use in the recognizer.

Marpa options can also be named arguments to new. For these, see "OPTIONS" in Parse::Marpa.

text

my $fail_offset = $recce->text( '2-0*3+1' );
if ( $fail_offset >= 0 ) {
    die("Parse failed at offset $fail_offset");
}

Extends the parse using the one-character-per-earleme model. The one, required, argument must be a string or a reference to a string which contains text to be parsed. If the parse is active after the text has been processed, the default end of parsing is set to the end of the text, the current earleme is set to the earleme just after the end of text, and -1 is returned.

If the parse is exhausted by the input, the character offset at which the parse was exhausted is returned. The character offset is the offset within the string which is the current argument. This offset is not necessarily the offset within the entire raw input. A zero return means that the parse was exhausted at character offset zero. The default end of parsing remains at the last earleme at which the parse was active. Failures, other than exhausted parses, are thrown as exceptions.

When you use the text method for input, earlemes correspond one-to-one to characters in the text. The earleme number of a character is always different from its character offset. The first character is at earleme 1. The first character of the first string argument is at offset 0. Subsequent characters within the first string argument will have an earleme number which is always one more than the character offset. Subsequent calls to text reset the character offset number to 0, but do not reset the earleme numbering. Earleme numbering within a recognizer always increases with each new character, even across multiple calls to text, and is never reset.

Terminals are recognized in the text using the lexers that were specified in the porcelain or the plumbing. The earleme length of each token is set to the length of the token in characters. (If a token has a "lex prefix", the length of the lex prefix counts as part of the token length.)

Terminals cannot span calls to text. If a series of characters which otherwise would be recognized as a terminal by a lexer is split between two calls to text, that terminal will not be recognized.

earleme

my $a = $grammar->get_symbol("a");
$recce->earleme([$a, "a", 1]) or die("Parse exhausted");

The earleme method adds tokens at the current earleme. Every call to the earleme method moves the current earleme forward by one earleme. Unlike text, the earleme method assumes no particular model of the input.

The earleme method takes zero or more arguments. Each argument is a token which starts at the current earleme. More than one token may be added at an each earleme, because ambiguous lexing is allowed. Each token argument is a reference to a three element array. The first element is a "cookie" for the token's symbol, as returned by the Parse::Marpa::Grammar::get_symbol method or the get_symbol method of a porcelain interface. The second element is the token's value in the parse, and may be any value legal in Perl 5, including undefined. The third is the token's length in earlemes.

The earleme method first adds the tokens in the arguments, if there were any. If, after all tokens have been added, the parse is still active, the default end of parsing is set to the current earleme. The current earleme is then advanced by one and the earleme method returns 1, indicating that the parse is still active.

The earleme method may be called without any arguments. If any previously added token ends after the current earleme, the parse will remain active. If the parse remains active, both the current earleme and the default end of parsing are incremented by one.

If the earleme method results in an exhausted parse, it returns 0. The default end of parsing remains at the last earleme at which the parse was active. The earleme method throws an exception on other failures.

An earleme remains the current earleme during only one call of the earleme method. All tokens starting at that earleme must be added in that call. The first time that the earleme method is called in a recognizer, the current earleme is at earleme 0.

This is the low-level token input method, and allows maximum control over scanning. No model of the input, or of the relationship between the tokens and the earlemes, is assumed. The user is free to invent her own.

end_input

$recce->end_input();

This method takes no arguments. It is used with the earleme method in offline mode, to indicate the end of input. The input is processed out to the last earleme at which a token ends, and the default end of parsing is set to that earleme. The current earleme is then set to the earleme after the default end of parsing.

SUPPORT

See the support section in the main module.

AUTHOR

Jeffrey Kegler

LICENSE AND COPYRIGHT

Copyright 2007 - 2008 Jeffrey Kegler

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.