NAME

Parse::Marpa::Recognizer - A Marpa Recognizer Object

SYNOPSIS

    my $recce = new Parse::Marpa::Recognizer({
	grammar => $grammar,
    });

    my $fail_offset = $recce->text(\("2-0*3+1"));
    if ($fail_offset >= 0) {
       die("Parse failed at offset $fail_offset");
    }

my $recce2 = Parse::Marpa::Recognizer::new({grammar => $grammar});

my $op = $grammar->get_symbol("op");
my $number = $grammar->get_symbol("number");
$recce2->earleme([$number, 2, 1]);
$recce2->earleme([$op, "-", 1]);
$recce2->earleme([$number, 0, 1]);
$recce2->earleme([$op, "*", 1]);
$recce2->earleme([$number, 3, 1]);
$recce2->earleme([$op, "+", 1]);
$recce2->earleme([$number, 1, 1]);
$recce2->end_input();

DESCRIPTION

Marpa parsing takes place in three major phases: grammar creation, input recognition and parse evaluation. Once a grammar has rules and is precomputed, a recognizer can be created from it. The recognizer accepts input and can be used to create a Marpa evaluator object.

Tokens and Earlemes

Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Marpa tokens can be of various lengths. Marpa tokens can even overlap.

For most parsers, their idea of position is a location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location. This flexibility is provide by tracking parse position in earlemes, which are named in honor of Jay Earley, the inventor of the algorithm on which Marpa is based.

If you do your lexing with the text method, you will use a one-character-per-earleme model. That is, input will be treated as a string, and, each earleme will be a character location in that string. If you set up your terminals using MDL, Marpa assumes that you will be using the text method and one-character-per-earleme matching.

Marpa is not restricted to the one-character-per-earleme model. With the earleme method, you can structure your input in almost any way you like. You could, for example, create a token stream and use a one-token-per-earleme model, and this would be equivalent to the standard way of doing things. You can also structure your input in other, special ways to suit your application.

In creating an earleme model of your input, there are only two restrictions:

  1. Tokens must be scanned in earleme order. That is, all the tokens at earleme N must be recognized before any token at earleme N+1.

  2. Tokens cannot be zero or negative in earleme length.

A parse is said to start at earleme 0, and "earleme N" means the location N earlemes after earleme 0. Length in earlemes probably means what you expect it does. The length from earleme 3 to earleme 6, for instance, is 3 earlemes.

The tokens text() recognizes are fed to the Marpa parse engine. The earleme length of each token is set using the tokens's earleme length. (If a token has a "lex prefix", the length of the lex prefix counts as part of the token length.)

Parse Exhaustion

In conventional Earley parsing, a parse is exhausted as soon as the parser reaches a "location" without a token. Because Marpa parses in terms of earlemes and tokens can span many earlemes, parses in Marpa remain viable even if they reach an "empty earleme". In fact, Marpa parses often contain many stretches of empty earlemes, and some of these stretches can be quite long.

In Marpa, a parse remains viable if some token has been recognized which ends at or after the current earleme. A Marpa parse is not exhausted until

  • No token starts at the current earleme, and

  • No token ends at or after the current earleme.

Note to Experts

Those of you already familiar with Earley parsing and its standard terminology may find the following helpful:

  • Each "earleme" correspond to an Earley set in the usual terminology.

  • In the usual terminology, an "empty earleme" would be an Earley set with no Earley items.

METHODS

new

my $recce = new Parse::Marpa::Recognizer({
   grammar=> $g,
   preamble => $new_preamble,
});

Parse::Marpa::Recognizer::new takes as its arguments a hash reference containing named arguments. It returns a new parse object or throws an exception. Either the compiled_grammar or the grammar option must be specified, but not both. A recognizer is created with the default end of parsing set to earleme 0, which is before any input.

If the grammar option is specified, its value must be a grammar object with rules defined. If it is not precomputed, new will precompute it. A deep copy of the grammar is then made to be used in the recognizer.

If the compiled_grammar option is specified, its value must be a Perl 5 string containing a compiled Marpa grammar, as produced by Parse::Marpa::Grammar::compile. It will be decompiled for use in the recognizer.

Marpa options are also valid named arguments. For these, see "OPTIONS" in Parse::Marpa.

text

local($RS) = undef;
my $spec = <FH>;
my $fail_offset = $recce->text(\$spec);
if ($fail_offset >= 0) {
   die("Parse failed at offset $fail_offset");
}

Extends the parse in the one-character-per-earleme model. The one, required, argument must be a reference to a string containing the text to be parsed. If the parse is active after the text has been processed, the default end of parsing is set to the end of the text, and -1 is returned.

If the parse is exhausted by the input, that is if processing reaches a point where no successful parse is possible, the default end of parsing is set to the earleme at which the parse was exhausted, and the character offset at which the parse was exhausted is returned. A zero return means that the parse was exhausted at character offset zero. Failures, other than exhausted parses, are thrown as exceptions.

Earlemes correspond one-to-one with characters, and the earleme number is always one more than the character offset from the start of text. The first character is at earleme one and offset zero. Terminals are recognized in the text using the lexers that were specified in the source file or with the raw interface.

earleme

my $op = $grammar->get_symbol("op");
$recce2->earleme([$op, "-", 1]);

The earleme method adds tokens at the current earleme. The arguments to are zero or more token alternatives. There may be more than one token added at an earleme, because ambiguous parsing is allowed.

Each token alternative is a reference to a three element array. The first element is a "cookie" for the token's symbol, as returned by the Parse::Marpa::get_symbol method. The second element is the token's value in the parse. The third is the token's length in earlemes.

Every call to the earleme method must add all the alternative tokens for the current earleme. The first call time the earleme method the current earleme is the first earleme, or earleme 0. Every time the earleme method is called thereafter, the current earleme is advanced by one. Every call to the earleme method sets the default end of parsing to the current earleme. If the earleme method is called with no arguments, it advances the parse one earleme, without adding any new tokens.

Returns 1 on success. Returns 0 if the parse was exhausted at that earleme. Throws an exception on other failures.

This is the low-level token input method, and allows maximum control over the context and form of tokens. No model of the relationship between the input and the earlemes is assumed. The user is free to invent her own.

end_input

$recce2->end_input();

This method takes no arguments. It is used with the earleme method in offline mode, to signal the end of input. The input is processed to the last earleme at which a token ends, and the default end of parsing is set to that earleme.

find_complete_rule

my ($end_earleme, $symbol_names) = $recce->find_complete_rule();

The find_complete_rule method was an experiment, and will be replaced. Arguments which specify a start_earleme, symbol and end_earleme are optional. If the start earleme is not specified, it defaults to earleme 0. If the end earleme is not specified, its default wll be the default parse end earleme, that is, the default location that Parse::Marpa::Recognizer::initial() would use for the end of parsing. The symbol argument, if specified, must be the raw interface name of a symbol.

The end earleme argument must be at or before the default parse end earleme. If you specify an end earleme after the default parse end earleme, it is ignored and the default parse end earleme is used as the end earleme.

find_complete_rule() looks for parses of complete rules, that is, rules whose right hand side has been completely matched. Only parses which start at the start earleme are considered.

find_complete_rule() looks first for any parses which end at the end earleme. If it finds none, it looks for shorter and shorter parses until it reaches the start earleme and is looking at null parses.

While the parses find_complete_rule() find are always for complete rules, they can be subparses in the sense that they are not parses from the grammar's start symbol. Complete parses starting from any symbol are considered, unless a start symbol was specified as an argument. In that case only parses starting from that symbol are considered.

On failure to find a rule matching the criteria, a zero length array is returned. On success, the return value is an array of two elements. The first element of the array is the earleme at which the complete parse ends. The second element is a pointer to an array of symbol names which are start symbols of parses in the span from start earleme to end earleme. Symbol names will be raw interface names.

Multiple start symbols may be returned, because several different rules may have been completed in the span from start earleme to end earleme, and some of these rules may have different left hand sides. If a start symbol argument was specified, it will be one of the list of symbols in the return value.

In the case where no start symbol is specified, find_complete_rule() is probably useless. It returns only information from the first Earley item which matches other criteria. Other Earley items may contain complete rules for the same span, but their left hand sides may not be included in the return value's list of start symbols.

find_complete_rule() was an experiment in methods for improved diagnostics, online mode, and advanced wizardry with grammars. It is probably going to be replaced. The replacement method or methods should, given an end earleme or a range of end earlemes, be able to return all completed and expected symbols. Information about their start and end earleme should be available with the completed symbols. For the expected symbols, the earleme at which they were expected should given.

SUPPORT

See the support section in the main module.

AUTHOR

Jeffrey Kegler

COPYRIGHT

Copyright 2007 - 2008 Jeffrey Kegler

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.