NAME
Parse::Marpa::Recognizer - Marpa Recognizer Objects
SYNOPSIS
my $recce = new Parse::Marpa::Recognizer( { grammar => $grammar } );
my $fail_offset = $recce->text( '2-0*3+1' );
if ( $fail_offset >= 0 ) {
die("Parse failed at offset $fail_offset");
}
my $recce = new Parse::Marpa::Recognizer({grammar => $grammar});
my $op = $grammar->get_symbol("Op");
my $number = $grammar->get_symbol("Number");
my @tokens = (
[$number, 2, 1],
[$op, "-", 1],
[$number, 0, 1],
[$op, "*", 1],
[$number, 3, 1],
[$op, "+", 1],
[$number, 1, 1],
);
TOKEN: for my $token (@tokens) {
next TOKEN if $recce->earleme($token);
die("Parsing exhausted at character: ", $token->[1]);
}
$recce->end_input();
DESCRIPTION
Marpa parsing takes place in three major phases: grammar creation, input recognition and parse evaluation. Once a grammar object has rules, a recognizer object can be created from it. The recognizer accepts input and can be used to create a Marpa evaluator object.
Tokens and Earlemes
Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Marpa tokens can be of various lengths. Marpa tokens can even overlap.
For most parsers, position is location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location. This flexibility is provided by tracking parse position in earlemes. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage.
If you do your lexing with the text
method, you will use a one-character-per-earleme model. The raw input to the parse will be a string made up from the series of strings and string references provided as arguments in calls to text
. Every character will be treated as being exactly one earleme in length.
Marpa is not restricted to the one-character-per-earleme model. With the earleme
method, you can structure your input in almost any way you like. You can, for example, create a token stream and use a one-token-per-earleme model, and this would be equivalent to the way things are typically done in other parsers. Marpa also allows you to structure your input in special ways to suit particular applications.
There are three restrictions on mapping tokens to earlemes:
Scanning always starts at earleme 0.
Tokens must be scanned in earleme order. That is, all the tokens starting at earleme
N
must be scanned before any token starting at earlemeN+1
.Tokens cannot be zero or negative in earleme length.
Earleme number N, or earleme N means the location N earlemes after earleme 0. Length in earlemes means what you expect it does. The length from earleme 3 to earleme 6, for instance, is 3 earlemes.
When a token is scanned, the start of the token is put at the current earleme. The end of the token is at earleme number c+l, where c is the location number of the current earleme, and l is the length of the token. The length of the token must be greater than zero.
The default end of parsing is tracked by each recognizer. The default end of parsing is at earleme 0 when the recognizer is created. It is incremented on calls to the text
and earleme
methods, as described below in the sections for those methods. When an evaluator object is created from a recognizer object, it inherits the recognizer object's default end of parsing.
Exhaustion
At the start of parsing, the furthest earleme is earleme 0. When a token is recognized, its end earleme is determined by adding the token length to the current earleme. If the new token's end earleme is after the furthest earleme, the furthest earleme is set at the new token's end earleme.
No more successful parses are possible when Marpa reaches an empty Earley set which is either immediately before the furthest earleme, or which is anywhere after the furthest earleme. At this point, the recognizer is said to be exhausted. A recognizer is active, if and only if it is not exhausted.
When the recognizer is exhausted or active, I sometimes say, more loosely, that the parser is exhausted (active), or that parsing is exhausted (active). In context of a particular parse being worked on, we can also speak of a parse being exhausted or active. Remember, however, that an exhausted recognizer will often contain successful parses prior to the current earleme. In fact, successful parsing in offline mode always leaves the recognizer exhausted. This mechanism is used to prevent further input.
Because tokens can be more than one earleme in length, parses in Marpa can remain active even if no token is found at the current earleme. In the one-character-per-earleme model, stretches where no token either begins or ends can be many earlemes in length.
METHODS
new
my $recce = new Parse::Marpa::Recognizer({
grammar=> $grammar,
lex_preamble => $new_lex_preamble,
});
The new
method's one, required, argument is a hash reference of named arguments. The new
method either returns a new parse object or throws an exception. Either the compiled_grammar
or the grammar
named argument must be specified, but not both. A recognizer is created with the current earleme and the default end of parsing both set at earleme 0.
If the grammar
option is specified, its value must be a grammar object with rules defined. If it is not precomputed, new
will precompute it. A deep copy of the grammar is then made to be used in the recognizer.
If the compiled_grammar
option is specified, its value must be a Perl 5 string containing a compiled Marpa grammar, as produced by Parse::Marpa::Grammar::compile
. It will be decompiled for use in the recognizer.
Marpa options can also be named arguments to new
. For these, see "OPTIONS" in Parse::Marpa.
text
my $fail_offset = $recce->text( '2-0*3+1' );
if ( $fail_offset >= 0 ) {
die("Parse failed at offset $fail_offset");
}
Extends the parse using the one-character-per-earleme model. The one, required, argument must be a string or a reference to a string which contains text to be parsed. If the parse is active after the text has been processed, the default end of parsing is set to the end of the text, the current earleme is set to the earleme just after the end of text, and -1 is returned.
If the recognizer is exhausted by the input, the character offset at which parsing was exhausted is returned. The character offset is the offset within the string which is the current argument. This offset is not necessarily the offset within the entire raw input. A zero return means that parsing was exhausted at character offset zero. The default end of parsing remains at the last earleme at which the parse was active. Failures, other than exhausted recognizers, are thrown as exceptions.
When you use the text
method for input, all characters will be treated as one earleme in length. The first character of the first string argument will be at character offset 0, and will start at earleme 0 and end at earleme 1. For each subsequent character of the first string argument, the character offset will increase by one. In the first string argument, the character at character offset c will always be one earleme long and end at earleme c+1.
Within each call to text
, every character increases the character offset, start earleme and end earleme by one. Each call to text
resets the character offset number to 0, but does not reset the earleme numbering.
Terminals are recognized in the text using the lexers that were specified in the porcelain or the plumbing. The earleme length of each token is set to the length of the token in characters. (If a token has a "lex prefix", the length of the lex prefix counts as part of the token length.)
Terminals cannot span calls to text
. If a series of characters which otherwise would be recognized as a terminal by a lexer is split between two calls to text
, that terminal will not be recognized.
earleme
my $a = $grammar->get_symbol("a");
$recce->earleme([$a, "a", 1]) or die("Parsing exhausted");
The earleme
method adds zero or more tokens, then moves the current earleme forward by one earleme. Unlike text
, the earleme
method assumes no particular model of the input.
The earleme
method takes zero or more arguments. Each argument represents a token which starts at the current earleme. More than one token may start at each earleme, because ambiguous lexing is allowed. There might be no tokens which start at the current earleme, in which case earleme
can be called with no arguments.
Each token argument is a reference to a three element array. The first element is a "cookie" for the token's symbol, as returned by the Parse::Marpa::Grammar::get_symbol
method or the get_symbol
method of a porcelain interface. The second element is the token's value in the parse, and may be any value legal in Perl 5, including undefined. The third is the token's length in earlemes.
The earleme
method first adds the tokens in the arguments. If, after all tokens have been added, the parse is still active, the default end of parsing is set to the current earleme. The current earleme is then advanced by one and the earleme
method returns 1.
It is possible that for a call to earleme with no token arguments to exhaust the recognizer. When this happens, the earleme method returns 0. The default end of parsing remains at the last earleme at which the parse was active. The earleme
method throws an exception on other failures.
An earleme remains the current earleme during only one call of the earleme
method. All tokens starting at that earleme must be added in that call. The first time that the earleme
method is called in a recognizer, the current earleme is at earleme 0.
This is the low-level token input method, and allows maximum control over scanning. No model of the input, or of the relationship between the tokens and the earlemes, is assumed. The user is free to invent her own.
end_input
$recce->end_input();
This method takes no arguments. It is used with the earleme
method in offline mode, to indicate the end of input. The input is processed out to the furthest earleme. The default end of parsing is set to the furthest earleme.
The current earleme will be advanced to one earleme past the furthest earleme. This will exhaust the recognizer, so that any further calls to earleme
will cause an exception.
SUPPORT
See the support section in the main module.
AUTHOR
Jeffrey Kegler
LICENSE AND COPYRIGHT
Copyright 2007 - 2008 Jeffrey Kegler
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.