NAME
Marpa::PP::Parse_Terms - Standard Parsing Terms used in the Marpa Documents
DESCRIPTION
This document is intended as a reminder of the standard vocabulary of parsing. I put defining uses of terms in boldface, for easy skimming. A reader who feels comfortable with parsing terminology can skip this document entirely. A reader completely new to parsing will find this document too terse and should look elsewhere first.
As an introduction, I recommend Mark Jason Dominus's excellent chapter on parsing in the Perl context. It's available on-line. Wikipedia is also an excellent place to start.
All the definitions here are consistent with at least some of the textbook definitions, and are in that sense standard. But no effort is made to cover the full range of standard meaning, or even to give the most common meaning. The focus is on the meaning of the terms as used in the Marpa documentation.
Basic terms
A grammar is a set of rules. The rules describe a set of strings of symbols. A string of symbols is often called a symbol string. The rules of a grammar are often called productions.
Stages of Parsing
A recognizer is a program that determines whether its input is one of the symbol strings in the set described by the rules of a grammar. A parser is a program which finds the structure of the input according to the rules of a grammar.
The term parsing is used in a strict and a loose sense. Parsing in the loose sense means all the phases of finding a grammar's structure, including a separate recognition phase if the parser has one. (Marpa does.) If a parser has phases, parsing in the strict sense refers specifically to the phase that finds the structure of the input. When the Marpa documents use the term parsing in its strict sense, they will speak explicitly of "parsing in the strict sense". Otherwise, parsing will mean parsing in the loose sense.
Parsers often use a lexical analyzer to convert raw input, usually input text, into a token stream, which is a series of tokens. Each token represents a symbol of the grammar and has a value. A lexical analyzer is often called a lexer or a scanner, and lexical analysis is often called lexing or scanning.
The series of symbols represented by the series of tokens becomes the symbol string input seen by the recognizer. The symbol string input is also called the input sentence.
By default, Marpa uses the token stream model of input. Marpa also allows alternative input models. These are new to Marpa, so that their terminology is of necessity non-standard. The terminology for alternative input models is the document which describes the models themselves.
Productions
A standard way of describing rules is Backus-Naur Form, or BNF. In one common way of writing BNF, a production looks like this.
Expression ::= Term Factor
In the production above, Expression
, Term
and Factor
are symbols. A production consists of a left hand side and a right hand side. In a context-free grammar, like those Marpa parses, the left hand side of a production is always a symbol string of length 1. The right hand side of a production is a symbol string of zero or more symbols. In the example, Expression
is the left hand side, and Term
and Factor
are right hand side symbols.
Left hand side and right hand side are often abbreviated as RHS and LHS. If the RHS of a production has no symbols, the production is called an empty production or an empty rule.
Any symbol which is allowed to occur in the symbol string input is called a terminal symbol. If the symbols in a symbol string are all terminals, that symbol string is also called a sentence.
Derivations
A step of a derivation, or derivation step, is made by taking a symbol string and any production in the grammar whose LHS occurs in that symbol string, and replacing any occurrence of the LHS symbol in the symbol string with the RHS of the production. For example, if A
, B
, C
, D
, and X
are symbols, and
X ::= B C
is a production, then
A X D -> A B C D
is a derivation step, with "A X D
" as its beginning and "A B C D
" as its end or result. We say that the symbol string "A X D
" derives the symbol string "A B C D
".
A derivation is a sequence of derivation steps. The length of a derivation is its length in steps.
We say that a first symbol string directly derives a second symbol string if and only if there is a derivation of length 1 from the first symbol string to the second symbol string.
Every symbol string is said to derive itself in a derivation of length 0. Such a zero length derivation is a trivial derivation.
If a derivation is not trivial or direct, that is, if it has more than one step, then it is an indirect derivation.
A derivation which is not trivial (that is, a derivation which has one or more steps) is a non-trivial derivation.
Technically, a symbol X
and a string that consists of only that symbol are two different things. But we often say "the symbol X
" as shorthand for "the string whose only symbol is X
". For example, if the string containing only the symbol X
derives a string Y
, we will usually say simply that "X
derives Y
".
Derivations are often described as productions or symbol matches. Wherever symbol or string X
derives Y
, we may also say X
produces Y
. Wherever symbol or string X
derives Y
, we may also say that Y
matches X
or that X
matches Y
. It is particularly common to say that X
matches Y
when Y
is a sentence.
In any parse, one symbol is distinguished as the start symbol. The parse of an input is successful if and only if the start symbol produces the input sentence according to the grammar. The set of all input sentences that a grammar and a start symbol will successfully parse is their language. Traditionally, the start symbol is part of a grammar's definition. If we regard the start symbol as fixed, language of a grammar and its start symbol is the language of the grammar.
Nulling
The length of a symbol string is the number of symbols in it. The zero length symbol string is called the empty string. The empty string can be considered to be a sentence, in which case it is the empty sentence. A string of one or more symbols is non-empty. A derivation which produces the empty string is a null derivation. A derivation from the start symbol which produces the empty string is a null parse.
If in a particular grammar, a symbol has a null derivation, it is a nullable symbol. If, in a particular grammar, the only sentence produced by a symbol is the empty sentence, it is a nulling symbol. All nulling symbols are nullable symbols.
If a symbol is not nullable, it is non-nullable. If a symbol is not nulling, it is non-nulling. In any instance where a symbol produces the empty string, it is said to be nulled, or to be a null symbol.
Useless Rules
If any derivation from the start symbol uses a rule, that rule is called reachable or accessible. A rule that is not accessible is called unreachable or inaccessible. If any derivation which results in a sentence uses a rule, that rule is said to be productive. A rule that is not productive is called unproductive. For example, a rule is unproductive unless every symbol on its RHS either is a terminal or is the LHS of some other rule. A rule which is inaccessible or unproductive is called a useless rule. Marpa can handle grammars with useless rules.
A symbol is reachable if it appears in a reachable production. A symbol is productive if it appears on the LHS of a productive rule, or if it is a nullable symbol. If a symbol is not reachable or not accessible, it is unreachable or inaccessible. If a symbol is not productive, it is unproductive. A symbol which is inaccessible or unproductive is called a useless symbol. Marpa can handle grammars with useless symbols.
Recursion and Cycles
If any symbol in the grammar non-trivially produces a symbol string containing itself, the grammar is said to be recursive. If any symbol non-trivially produces a symbol string with itself on the left, the grammar is said to be left-recursive. If any symbol non-trivially produces a symbol string with itself on the right, the grammar is said to be right-recursive. Marpa can handle all recursive grammars, including grammars which are left-recursive, grammars which are right-recursive, and grammars which contain both left- and right-recursion.
A cycle is a non-trivial derivation of a string of symbols from itself. A grammar which is considered cycle-free if it is not possible for any parse using that grammar to contain a cycle.
Traditionally, in fact, almost universally, only cycle-free grammars are considered useful. This is because a cycle can be repeated over and over again, even in the derivation of a finite input sentence. This makes a cycle the parsing equivalent of an infinite loop.
A grammar which contains a cycle is also called infinitely ambiguous. Marpa can handle cycles, and can produce parses for infinitely ambiguous grammars.
Parse Structure
The structure of a parse can be represented as a series of derivation steps from the start symbol to the input. Another way to represent structure is as a parse tree. Every symbol used in the parse is represented by a node of the parse tree. Wherever a production is used in the parse, its LHS is represented by a parent node and the RHS symbols are represented by child nodes. The start symbol is the root of the tree. The node at the root of the tree is called the start node.
Traditionally, a symbol is either a terminal symbol or it is not; and a terminal symbol must always be used as a terminal symbol. Marpa is non-traditional in allowing terminal symbols to be used as non-terminals. In Marpa, a symbol is a terminal symbol if it is one that can be used as a terminal. In Marpa, by default, all symbols are terminal symbols.
Traditionally, and in Marpa as well, every node is either a inner node or a leaf node. In Marpa, leaf nodes are of two kinds:
Nodes for nulled symbols. In Marpa, nulled symbols are symbol on the left hand side of empty productions. A node for a nulled symbol is called a nulled node.
Nodes for symbols being used as terminals.
Ambiguity
Marpa allows ambiguous grammars. Traditionally we say that a parse is ambiguous if, for a given grammar and a given input, more than one derivation tree is possible. However, Marpa allows ambiguous input tokens, which the traditional definition does not take into account. If Marpa used the traditional definition, all grammars would be ambiguous except those grammars which allowed only the null parse.
It is easiest if the Marpa definition and the traditional definition were extensionally equivalent --- that is, if Marpa's set of ambiguous grammars was exactly the same as the set of traditionally ambiguous grammars. This can be accomplished by using a slightly altered definition. In the Marpa context, an ambiguous grammar is a grammar that, for some unambiguous input, will produce more than one parse result.
Semantics
In real life, the structure of a parse is usually a means to an end. Grammars usually have a semantics associated with them, and what the user actually wants is the value of the parse according to the semantics.
The tree representation is especially useful when evaluating a parse. In the traditional method of evaluating a parse tree, every node which represents a terminal symbol has a value associated with it on input. Non-null inner nodes take their semantics from the production whose LHS they represent. Nulled nodes are dealt with as special cases.
The semantics for a production describe how to calculate the value of the node which represents the LHS (the parent node) from the values of zero or more of the nodes which represent the RHS symbols (child nodes). Values are computed recursively, bottom-up. The value of a parse is the value of its start symbol.
COPYRIGHT AND LICENSE
Copyright 2011 Jeffrey Kegler
This file is part of Marpa::PP. Marpa::PP is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::PP is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::PP. If not, see
http://www.gnu.org/licenses/.