NAME
Marpa::XS::Advanced::Models - Other Input Models
ABOUT THIS DOCUMENT
The alternative input models desribed in this document are an advanced technique. If you are starting out with Marpa, you probably want to ignore this document. If you are an advanced Marpa user, it is still safe to ignore this document, but you might find the possibilities it discusses interesting.
DESCRIPTION
Token Streams
Marpa's default input model is the traditional one -- a token stream. Token streams are very standard in parsing applications -- so much so that most texts do not take the trouble of defining the term. A token stream is input structured as a sequence of tokens, where each token occupies one location and every location has a token. In the token stream model, all tokens are of the same length.
Conventionally, all tokens are of length 1, and the token stream starts at location 0. Following this convention, the Nth token would start at location N-1 and end at location N. For example, the first token would start at location 0 and end at location 1. The second token would start at location 1 and end at location 2.
Earlemes
For most parsers, position is location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location.
Marpa's tracks position in terms of earlemes. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage. Internally, each earleme corresponds exactly and directly to an Earley set. Every token has a start earleme and an end earleme.
The token stream model may also be called the token-per-earleme model. In the token stream model, token location and earleme location directly correspond on an one-to-one basis. That means that, in the token stream model, earleme location and token location are essentially equivalent. If a user's application uses the token stream model, he can ignore earlemes, and think entirely in terms of tokens and position in a token stream.
It can be useful to have the structure of the input relate to earleme location in other ways. The purpose of this document is to discuss alternatives to the token stream model.
INTERFACE
Alternative models are implemented using the optional third and fourth parameters of the token descriptors. Token descriptors are used in the arguments to the Marpa Recognizer's tokens method.
Token Length
Token length is the optional third element of the token descriptor. By default, it is 1, which is the correct value for the token stream model. Its value can be any integer greater than zero. Marpa does not allow zero length tokens in any input model.
Token Offset
The token offset is the fourth element of each token's descriptor. Its value is taken as the offset to be added to the current earleme location. The current earleme location is the earleme which will be the start earleme of the next token to be added. When parsing begins, the current earleme location is earleme 0.
Negative token offsets are not allowed. Zero token offsets are allowed except for the last token descriptor of a tokens() call. A zero token offset will cause the next token to start at the same location. Multiple tokens starting at the same location will cause the lexing to be ambiguous. Marpa supports ambiguous lexing.
When token offset is left undefined, it defaults to 1. This is the correct value for the token stream model of input.
The intuitive expectation seems to be that every tokens() call will advance the current earleme, with the advance taking effect after the last token descriptor in the tokens() call. Because of this, the token offset of the last token descriptor in every tokens() call is treated as a special case. The token offset of the last token descriptor cannot be zero. Even when a token offset of zero is explicitly specified for the last token descriptor of a tokens() call, it is silently changed to 1.
When laid out explicitly, this behavior is seen as non-orthogonal, but in practice it seems to adhere to the principle of least surprise. In ambiguous lexing, the typical loop makes one pass for each earleme, and gathers all the tokens that can start at that earleme into a list. A programmer first writing a loop like this will usually give all the tokens a token offset of 0, intuitively expecting that Marpa will understand that his intention is to advance the current earleme after the tokens() call is complete.
Silently ignoring a final token offset of 0 eliminates the need for some ugly special case logic, It is what the programmer intuitively expects. Therefore it is what Marpa does.
AMBIGUOUS LEXING
Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Marpa tokens can be of various lengths. Marpa tokens can even overlap.
Potentially ambiguous lexing occurs when more than one token starts at a single earleme. When more than one token starts at a single earleme, the parser cannot use more than one of them, and must choose.
When ambiguous lexing occurs, several different sequences of tokens are possible. An actual ambiguity only occurs if more than one of the potential token sequences is consistent with the grammar. If there is no actual ambiguity, Marpa will use the only token choice which is consistent with the grammar.
When lexing is actually ambiguous, Marpa will use all the alternatives consistent with the grammar. When the lexing in a parse is actually ambiguous, the parse will be ambiguous. From the point of view of Marpa's semantics, ambiguity caused by lexing is exactly the same as ambiguity caused by an ambiguous grammar.
In the standard terminology, a grammar is ambiguous if it can produce more than one parse result, and vice versa. In Marpa this is not strictly true. In Marpa, an unambiguous grammar will produce than one parse, if the lexing is actually ambiguous.
EARLEMES: THE DETAILS
While scanning, Marpa keeps track of the current earleme. Earlemes in a parse start at earleme 0 and increase numerically. The earleme immediately following earleme 0 is earleme 1, the earleme immediately following earleme 1 is earleme 2, and so on. The earleme immediately following earleme N is always earleme N+1.
Distance in the earleme stream is what you would intuitively expect it to be. The distance between earleme X and earleme Y is the absolute value of the difference between X and Y, |X-Y|. The distance from earleme 3 to earleme 6, for example, is 3 earlemes.
Whenever a token is given to Marpa to be scanned, it starts at the current earleme. In addition to the type and value of the token, Marpa must be told token's length in earlemes. The length of a Marpa token must be greater than zero. This earleme length will become the distance from the start of the token to the end of the token.
The start of the token is put at the current earleme. If the length of the token is L, and the number of the current earleme is C, the end of the token will be at the earleme number C+L.
THE CHARACTER-PER-EARLEME MODEL
Many different models of the relationship between tokens and earlemes are possible, but two are particularly important. One is the one-token-per-earleme model, which has already been described. The other is the one-character-per-earleme model.
In the one-character-per-earleme model, every character will be treated as being exactly one earleme in length. If a token is more than one character in length, that token will span earlemes. When the lexing is ambiguous, tokens may overlap.
When a one-character-per-earleme model of input is used, there may be many earlemes at which no tokens start. In a standard implementation of a grammar for a language which allows comments, for example, no tokens will start at any earlemes which corresponds to character locations inside a comment.
OTHER INPUT MODELS
While only the token-per-earleme and character-per-earleme models has seen any serious use, other models are certainly possible. Using earlemes, you can structure your input in almost any way you like.
There are only three restrictions:
Scanning always starts at earleme 0.
All tokens starting at earleme N must be scanned before any tokens starting at earleme N+1. In other words, the tokens must be scanned in non-decreasing order by start earleme.
Every token must have a length, in earlemes, which is greater than zero. In other words, token length can never be zero or negative.
Marpa parsing can remain active even if no token is found at the current earleme. The current earleme might fall in the middle of a previously recognized token. In this case, and if the input remains consistent with the grammar, parsing will remain active at least until the end earleme of that token is reached.
In an alternative model, stretches where no token either starts or ends can be many earlemes in length. This is in fact often the case for one of the models which have already seen practical use -- the one-character-per-earleme model.
COPYRIGHT AND LICENSE
Copyright 2010 Jeffrey Kegler
This file is part of Marpa::XS. Marpa::XS is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::XS is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::XS. If not, see
http://www.gnu.org/licenses/.