NAME
Parse::Marpa::Recognizer - Marpa Recognizer Objects
SYNOPSIS
my $recce = new Parse::Marpa::Recognizer( { grammar => $grammar } );
my $fail_offset = $recce->text( '2-0*3+1' );
if ( $fail_offset >= 0 ) {
croak("Parse failed at offset $fail_offset");
}
my $recce = new Parse::Marpa::Recognizer({grammar => $grammar});
my $op = $grammar->get_symbol('Op');
my $number = $grammar->get_symbol('Number');
my @tokens = (
[$number, 2, 1],
[$op, q{-}, 1],
[$number, 0, 1],
[$op, q{*}, 1],
[$number, 3, 1],
[$op, q{+}, 1],
[$number, 1, 1],
);
TOKEN: for my $token (@tokens) {
next TOKEN if $recce->earleme($token);
croak('Parsing exhausted at character: ', $token->[1]);
}
$recce->end_input();
DESCRIPTION
Marpa parsing takes place in three major phases: grammar creation, input recognition and parse evaluation. Once a grammar object has rules, a recognizer object can be created from it. The recognizer accepts input and can be used to create a Marpa evaluator object.
Tokens and Earlemes
Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Marpa tokens can be of various lengths. Marpa tokens can even overlap.
For most parsers, position is location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location. This flexibility is provided by tracking parse position in earlemes. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage.
If you do your lexing with the text
method, you will use a one-character-per-earleme model. The raw input to the parse will be a string made up from the series of strings and string references provided as arguments in calls to text
. Every character will be treated as being exactly one earleme in length.
Marpa is not restricted to the one-character-per-earleme model. With the earleme
method, you can structure your input in almost any way you like. You can, for example, create a token stream and use a one-token-per-earleme model, and this would be equivalent to the way things are typically done in other parsers. Marpa also allows you to structure your input in special ways to suit particular applications.
There are three restrictions on mapping tokens to earlemes:
Scanning always starts at earleme 0.
Tokens must be scanned in earleme order. That is, all the tokens starting at earleme
N
must be scanned before any token starting at earlemeN+1
.Tokens cannot be zero or negative in earleme length.
Earleme number N, or earleme N means the location N earlemes after earleme 0. Length in earlemes means what you expect it does. The length from earleme 3 to earleme 6, for instance, is 3 earlemes.
When a token is scanned, the start of the token is put at the current earleme. The end of the token is at earleme number c+l, where c is the location number of the current earleme, and l is the length of the token. The length of the token must be greater than zero.
The default end of parsing is tracked by each recognizer. The default end of parsing is at earleme 0 when the recognizer is created. It is incremented on calls to the text
and earleme
methods, as described below in the sections for those methods. When an evaluator object is created from a recognizer object, it inherits the recognizer object's default end of parsing.
Exhaustion
At the start of parsing, the furthest earleme is earleme 0. When a token is recognized, its end earleme is determined by adding the token length to the current earleme. If the new token's end earleme is after the furthest earleme, the furthest earleme is set at the new token's end earleme.
No more successful parses are possible when Marpa reaches an empty Earley set which is either immediately before the furthest earleme, or which is anywhere after the furthest earleme. At this point, the recognizer is said to be exhausted. A recognizer is active, if and only if it is not exhausted.
When the recognizer is exhausted or active, I sometimes say, more loosely, that the parser is exhausted (active), or that parsing is exhausted (active). In context of a particular parse being worked on, we can also speak of a parse being exhausted or active. Remember, however, that an exhausted recognizer can contain successful parses prior to the current earleme. In fact, successful parsing always leaves the recognizer exhausted.
Because tokens can be more than one earleme in length, parses in Marpa can remain active even if no token is found at the current earleme. In the one-character-per-earleme model, stretches where no token either begins or ends can be many earlemes in length.
Cloning
The new
constructor requires a grammar to be specified in one of its arguments. By default, the new
constructor clones the grammar object. This is done so that recongnizers do not interfere with each other by modifying the same data. Cloning is the default behavior, and is always safe.
While safe, cloning does impose an overhead in memory and time. This can be avoided by using the clone
option with the new
constructor. Not cloning is safe if you know that the grammar object will not be shared by another recognizer or by more than one evaluator.
It is very common for a Marpa program to have a simple structure, where no more than one recognizer is created from any grammar, and no more than one evaluator is created from any recognizer. When this is the case, cloning is unnecessary.
METHODS
new
my $recce = new Parse::Marpa::Recognizer({
grammar=> $grammar,
lex_preamble => $new_lex_preamble,
});
The new
method's one, required, argument is a hash reference of named arguments. The new
method either returns a new parse object or throws an exception. Either the stringified_grammar
or the grammar
named argument must be specified, but not both. A recognizer is created with the current earleme and the default end of parsing both set at earleme 0.
If the grammar
option is specified, its value must be a grammar object with rules defined. By default, the grammar is cloned for use in the recognizer.
If the stringified_grammar
option is specified, its value must be a Perl 5 string containing a stringified Marpa grammar, as produced by Parse::Marpa::Grammar::stringify
. It will be unstringified for use in the recognizer. When the stringified_grammar
option is specified, the resulting grammar is never cloned, regardless of the setting of the clone
argument.
If the clone
argument is set to 1, and the grammar argument is not in stringified form, new
clones the grammar object. This prevents that multiple evaluators from interfering with each other's data. This is the default and is always safe. If clone
is set to 0, the evaluator will work directly with the grammar object which was its argument. See above for more detail.
Marpa options can also be named arguments to new
. For these, see Parse::Marpa::Doc::Options.
text
my $fail_offset = $recce->text( '2-0*3+1' );
if ( $fail_offset >= 0 ) {
croak("Parse failed at offset $fail_offset");
}
Extends the parse using the one-character-per-earleme model. The one, required, argument must be a string or a reference to a string which contains text to be parsed. If the parse is active after the text has been processed, the default end of parsing is set to the end of the text, the current earleme is set to the earleme just after the end of text, and -1 is returned.
If the recognizer is exhausted by the input, the character offset at which parsing was exhausted is returned. The character offset is the offset within the string which is the current argument. This offset is not necessarily the offset within the entire raw input. A zero return means that parsing was exhausted at character offset zero. The default end of parsing remains at the last earleme at which the parse was active. Failures, other than exhausted recognizers, are thrown as exceptions.
When you use the text
method for input, all characters will be treated as one earleme in length. The first character of the first string argument will be at character offset 0, and will start at earleme 0 and end at earleme 1. For each subsequent character of the first string argument, the character offset will increase by one. In the first string argument, the character at character offset c will always be one earleme long and end at earleme c+1.
Within each call to text
, every character increases the character offset, start earleme and end earleme by one. Each call to text
resets the character offset number to 0, but does not reset the earleme numbering.
Terminals are recognized in the text using the lexers that were specified in the porcelain or the plumbing. The earleme length of each token is set to the length of the token in characters. (If a token has a "lex prefix", the length of the lex prefix counts as part of the token length.)
Terminals cannot span calls to text
. If a series of characters which otherwise would be recognized as a terminal by a lexer is split between two calls to text
, that terminal will not be recognized.
earleme
my $a = $grammar->get_symbol('a');
$recce->earleme([$a, 'a', 1]) or croak('Parsing exhausted');
The earleme
method adds zero or more tokens, then moves the current earleme forward by one earleme. Unlike text
, the earleme
method assumes no particular model of the input.
The earleme
method takes zero or more arguments. Each argument represents a token which starts at the current earleme. More than one token may start at each earleme, because ambiguous lexing is allowed. There might be no tokens which start at the current earleme, in which case earleme
can be called with no arguments.
Each token argument is a reference to a three element array. The first element is a "cookie" for the token's symbol, as returned by the Parse::Marpa::Grammar::get_symbol
method or the get_symbol
method of a porcelain interface. The second element is the token's value in the parse, and may be any value legal in Perl 5, including undefined. The third is the token's length in earlemes.
The earleme
method first adds the tokens in the arguments. If, after all tokens have been added, the parse is still active, the default end of parsing is set to the current earleme. The current earleme is then advanced by one and the earleme
method returns 1.
It is possible that for a call to earleme with no token arguments to exhaust the recognizer. When this happens, the earleme method returns 0. The default end of parsing remains at the last earleme at which the parse was active. The earleme
method throws an exception on other failures.
An earleme remains the current earleme during only one call of the earleme
method. All tokens starting at that earleme must be added in that call. The first time that the earleme
method is called in a recognizer, the current earleme is at earleme 0.
This is the low-level token input method, and allows maximum control over scanning. No model of the input, or of the relationship between the tokens and the earlemes, is assumed. The user is free to invent her own.
end_input
$recce->end_input();
Used to indicate the end of input. end_input
takes no arguments. end_input
processes the input out to the furthest earleme; sets the default end of parsing to the furthest earleme; and advances the current earleme to one earleme past the furthest earleme. If it does not throw an exception, end_input
returns a true value.
Since positioning the current earleme past the furthest earleme leaves the recognizer exhausted, any further calls to text
or earleme
will throw an exception. end_input
itself is idempotent. If called more than once, on subsequent calls, end_input
will do nothing, successfully.
stringify
my $stringified_recce = $recce->stringify();
The stringify
method takes as its single argument a recognizer object and converts it into a string. It returns a reference to the string. The string is created using Data::Dumper. On failure, stringify
throws an exception.
unstringify
$recce = Parse::Marpa::Recognizer::unstringify($stringified_recce, $trace_fh);
$recce = Parse::Marpa::Recognizer::unstringify($stringified_recce);
The unstringify
static method takes a reference to a stringified recognizer as its first argument. Its second, optional, argument is a file handle. The file handle argument will be used both as the unstringified recognizer's trace file handle, and for any trace messages produced by unstringify
itself. unstringify
returns the unstringified recognizer object unless it throws an exception.
If the trace file handle argument is omitted, it defaults to STDERR
and the unstringified recognizer's trace file handle reverts to the default for a new recognizer, which is also STDERR
. The trace file handle argument is necessary because in the course of stringifying, the recognizer's original trace file handle may have been lost.
clone
my $cloned_recce = $recce->clone();
The <clone> method creates a useable copy of a recognizer object. It returns a successfully cloned recognizer object, or throws an exception.
SUPPORT
See the support section in the main module.
AUTHOR
Jeffrey Kegler
LICENSE AND COPYRIGHT
Copyright 2007 - 2008 Jeffrey Kegler
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.