NAME

Marpa::PP::Advanced::Models - Other Input Models

ABOUT THIS DOCUMENT

This document is under construction. As of this release, it is not 100% complete.

The alternative input models described in this document are an advanced technique. If you are starting out with Marpa, you probably want to ignore this document. If you are an experienced Marpa user, it is still safe to ignore this document, but you might find the possibilities it discusses interesting.

DESCRIPTION

What is an Alternative Input Model?

An alternative input model is anything that is not the default, token-stream model. More helpfully, Marpa allows variable-length tokens and ambiguous tokens, and an alternative input model is any input model which

  • Allows a token whose length is not exactly 1, or

  • Allows locations which have more than one token.

To do either of these things, a user must use the recognizer's alternative method. In other words, if you are not using the recognizer's alternative method call, you are not using an alternative input method.

Many concepts, such as parsing location, parse exhaustion, and the end of parsing, are somewhat more complicated when alternative input models are involved. These concepts were explained in the main document for the recognizer on the assumption that the default input model was in use. This document revises those explanations as necessary to allow for the alternative input models.

Token Streams

Marpa's default input model is the traditional one -- a token stream. Token streams are very standard in parsing applications -- so much so that most texts do not take the trouble of defining the term. A token stream is input structured as a sequence of tokens, where each token occupies one location and every location has a token. In the token stream model, all tokens are of the same length.

Conventionally, all tokens are of length 1, and the token stream starts at location 0. Following this convention, the Nth token would start at location N-1 and end at location N. For example, the first token would start at location 0 and end at location 1. The second token would start at location 1 and end at location 2.

Earlemes

For most parsers, position is location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location.

Marpa's tracks position in terms of earlemes. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage. Internally, each earleme corresponds exactly and directly to an Earley set. Every token has a start earleme and an end earleme.

The token stream model may also be called the token-per-earleme model. In the token stream model, token location and earleme location directly correspond on an one-to-one basis. If a user's application uses the token stream model, he can ignore earlemes, and think entirely in terms of tokens and position in a token stream.

The Furthest Earleme

The furthest earleme is the last earleme at which a token ends. In the default input model, the furthest earleme is not an important concept. When input is performed using the recognizer's read method, the furthest earleme and the current earleme are always the same.

In alternative input models, tokens may be longer than 1 earleme, and the furthest earleme and the current earleme may be far apart. This becomes an issue when parsing is finished. In alternative input models, calling the recognizer's earleme_complete method at the end of input is not enough to ensure that all input has been processed. The application must also call the recognizer's end_input method to ensure that processing of input catches up to the furthest earleme.

METHODS

alternative

defined $recce->alternative( 'a', 42, 1 )
    or return 'First alternative failed';

The alternative method is the most general method for reading input, and is therefore used in alternative input models. It takes three arguments, only the first of which is required.

The first two arguments are similar to those of the recognizer's read method: the token type and the token value. As with the read method, the token value is optional and, if omitted, defaults to a Perl undef.

The third argument to the alternative method is a token length. If omitted, token length defaults to 1, which is the correct value for the token stream model. Its value can be any integer greater than zero. Marpa does not allow zero length tokens in any input model.

Unlike the read method, the alternative method does not advance the current earleme on each call. This allows the application to read several tokens at the same earleme. This is how ambiguous input is created. To advance the current earleme when input is read using the alternative method, the complete_earleme method must be called.

On success, alternative returns the current earleme. On failures, other than the rejection of a token in interactive mode, alternative throws an exception.

If the token is rejected, alternative returns a Perl undef. The user checking for token rejection must be careful to test that the return value is undefined, and not merely a Perl false. When a token is accepted at earleme 0, the alternative method returns 0.

earleme_complete

$recce->earleme_complete();

Processes all tokens at the current earleme and advances the current earleme by 1. Returns the number of terminals acceptable at the new current earleme. Note that, in alternative input models, a return value of 0 does not necessarily indicate an exhausted parser.

When reading input using the alternative method, earleme_complete is used to move forward in the input stream. All tokens read using the alternative method are read at the same location -- the current earleme. A earleme_complete call causes the tokens read using the alternative method to be processed, and the current earleme advanced.

earleme_complete may be called even if the alternative method has been not called since the last call to earleme_complete. This will create an earleme with no tokens.

exhausted

$recce->exhausted() and die 'Recognizer exhausted';

The exhausted method returns a Perl true if parsing in a recognizer is exhausted, and a Perl false otherwise. Parsing is exhausted when the recognizer will not accept any further input. This method is seldom used. For most applications, the return values of the other methods provide sufficient information about the parse status, and there is no need to specifically test for parse exhaustion.

In the default input model, a recognizer was exhausted if the read method returned 0, indicating that no tokens would be accepted at the new current earleme. In alternative input models, earleme_complete will always return 0 when the parser is exhausted, but the converse is not always true: earleme_complete may return 0 even in some cases when the parser is not exhausted.

When earleme_complete returns 0, no input will be accepted at the current earleme. But in input models which allow tokens longer than 1, it is possible for input to be accepted at earlemes after the current earleme, even if no input is accepted at the current earleme.

end_input

$recce->end_input();

Indicates that input is finished. Calling end_input is not necessary or useful in the default input model, because in the default input model no token has a length greater than 1.

The end_input method takes no arguments. The end_input method returns a Perl true value on success. On failure, it throws an exception.

In alternative input models, calling the earleme_complete method once input is finished does not ensure that all input has been processed. The earleme_complete method completes the current earleme, but in alternative models, tokens may extend well past the current earleme. The end_input method ensures that all input is processed.

Calling end_input multiple times on the same recognizer object is harmless, but useless. The second and subsequent calls will return a Perl true, but will have no effect.

ALTERNATIVE MODELS AND THE read METHOD

A recognizer can mix calls to its read method with calls to its alternative method. The read method has the same effect as a single call to the alternative method, followed immediately by a call of the earleme_complete method. More precisely, the read method is equivalent to

    sub Marpa::PP::Recognizer::read {
	my $recce = shift;
	return defined $recce->alternative(@_) ? $recce->earleme_complete() : undef;
    }

AMBIGUOUS LEXING

Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Ambiguous tokens can be of various lengths. Tokens can also overlap.

Potentially ambiguous lexing occurs when more than one token starts at a single earleme. When potentially ambiguous lexing occurs, it becomes possible for there to be more than one sequence of tokens.

An actual ambiguity only occurs if more than one of the potential token sequences is consistent with the grammar. If there is no actual ambiguity, Marpa will use the only token choice that is consistent with the grammar.

When lexing is actually ambiguous, Marpa will use all the alternatives consistent with the grammar. When the lexing in a parse is actually ambiguous, the parse will be ambiguous. From the point of view of Marpa's semantics, ambiguity caused by lexing is exactly the same as ambiguity caused by an ambiguous grammar.

In the standard terminology, if a grammar produces more than one parse result, then that grammar must be ambiguous. In Marpa this is not strictly true. In Marpa, if the input is ambiguous, even an unambiguous grammar can produce than one parse.

DUPLICATE TOKENS

A duplicate token is a token of the same type and the same length as another that was read at the same earleme. Duplicate tokens are impossible in the default, token-stream, model. This is because in the token-stream model only one token can be read at each earleme.

In alternative models, more than one token may be read at an earleme, and duplicates are possible. Marpa detects duplicate tokens and treats them as "hard errors" -- Marpa throws an exception when it sees a duplicate token. Marpa's assumption is that duplicate tokens indicate an error at the application level.

An application can retry input after a duplicate token, if it catches the exception. In the future, if recovery from duplicate tokens is found to be a useful technique, Marpa may provide an option to change its behavior, so that a soft failure is returned when there is a duplicate token.

EARLEMES: THE DETAILS

While scanning, Marpa keeps track of the current earleme. Earlemes in a parse start at earleme 0 and increase numerically. The earleme immediately following earleme 0 is earleme 1, the earleme immediately following earleme 1 is earleme 2, and so on. The earleme immediately following earleme N is always earleme N+1.

Distance in the earleme stream is what you would intuitively expect it to be. The distance between earleme X and earleme Y is the absolute value of the difference between X and Y, |X-Y|. The distance from earleme 3 to earleme 6, for example, is 3 earlemes.

Whenever a token is given to Marpa to be scanned, it starts at the current earleme. In addition to the type and value of the token, Marpa must be told token's length in earlemes. The length of a Marpa token must be greater than zero.

This earleme length will become the distance from the start of the token to the end of the token. If the length of the token is L, and the number of the current earleme is C, the end of the token will be at the earleme whose number is C+L.

THE CHARACTER-PER-EARLEME MODEL

Many different models of the relationship between tokens and earlemes are possible, but two are particularly important. One is the one-token-per-earleme model, which is the default, and which has already been described. The other is the one-character-per-earleme model.

In the one-character-per-earleme model, every character will be treated as being exactly one earleme in length. If a token is more than one character in length, that token will span earlemes. When the lexing is ambiguous, tokens may overlap.

When a one-character-per-earleme model of input is used, there may be many earlemes at which no tokens start. For example, in a straightforward character-per-earleme implementation of a grammar for a language which allows comments, no tokens will start at any earlemes which corresponds to character locations inside a comment.

OTHER INPUT MODELS

While only the token-per-earleme and character-per-earleme models have seen any real use, other models are certainly possible. Using earlemes, you can structure your input in almost any way you like.

There are only three restrictions:

  1. Scanning always starts at earleme 0.

  2. All tokens starting at earleme N must be scanned before any tokens starting at earleme N+1. In other words, the tokens must be scanned in non-decreasing order by start earleme.

  3. Every token must have a length, in earlemes, which is greater than zero. In other words, token length can never be zero or negative.

COPYRIGHT AND LICENSE

Copyright 2011 Jeffrey Kegler
This file is part of Marpa::PP.  Marpa::PP is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.

Marpa::PP is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser
General Public License along with Marpa::PP.  If not, see
http://www.gnu.org/licenses/.