# Copyright 2011 Jeffrey Kegler
# This file is part of Marpa::XS.  Marpa::XS is free software: you can
# redistribute it and/or modify it under the terms of the GNU Lesser
# General Public License as published by the Free Software Foundation,
# either version 3 of the License, or (at your option) any later version.
#
# Marpa::XS is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser
# General Public License along with Marpa::XS.  If not, see
# http://www.gnu.org/licenses/.

=head1 NAME

Marpa::XS::Parse_Terms - Standard Parsing Terms used in the Marpa Documents

=head1 DESCRIPTION

This document is intended
as a reminder of
the standard vocabulary of parsing.
I put B<defining uses> of terms in boldface, for easy skimming.
A reader who feels comfortable with parsing terminology
can skip this document entirely.
A reader completely new to parsing will find this document too
terse and should look elsewhere first.

As an introduction, I recommend
L<Mark Jason Dominus's
excellent chapter on parsing in the Perl context|Marpa::XS::Advanced::Bibliography/"Dominus 2005">.
It's available on-line.
L<Wikipedia|Marpa::XS::Advanced::Bibliography/"Wikipedia"> is also an excellent place to start.

All the definitions here are consistent with
at least some of the textbook definitions,
and are in that sense standard.
But no effort is made to
cover the full range of standard meaning,
or even to give the most common meaning.
The focus is on the meaning of the terms as used in the Marpa documentation.


=head2 Basic terms

A B<grammar> is a set of rules.
The B<rules> describe a set of strings of B<symbols>.
A string of symbols is often called a B<symbol string>.
The rules of a grammar are often called B<productions>.

=head2 Stages of Parsing

A B<recognizer> is a program that determines whether its B<input>
is one of the symbol strings in the set described by the rules of a grammar.
A B<parser> is a program which finds the structure of the input
according to the rules of a grammar.

The term B<parsing> is used in a strict and a loose sense.
B<Parsing> in the loose sense means all the phases of finding a grammar's structure,
including a separate recognition phase if the parser has one.  (Marpa does.)
If a parser has phases,
B<parsing in the strict sense> refers specifically to the phase that finds the structure of the input.
When the Marpa documents use the term B<parsing> in its strict sense, they will
speak explicitly of "parsing in the strict sense".
Otherwise, B<parsing> will mean parsing in the loose sense.

Parsers often use a
B<lexical analyzer> to convert B<raw input>,
usually B<input text>,
into a series of B<tokens>.
Each token represents a B<symbol> of the grammar and has a B<value>.
The series of symbols represented by the series of tokens
becomes the B<symbol string input>
seen by the recognizer.
The B<symbol string input> is also called the B<input sentence>.
A lexical analyzer is often called a B<lexer> or a B<scanner>,
and B<lexical analysis> is often called B<lexing> or B<scanning>.

=head2 Productions

A standard way of describing rules is Backus-Naur Form, or B<BNF>.
In one common way of writing BNF, a production looks like this.

=for Marpa::XS::Display:
ignore: 1

    Expression ::= Term Factor

=for Marpa::XS::Display::End

In the production above, C<Expression>, C<Term> and C<Factor> are symbols.
A production consists of a B<left hand side> and a B<right hand side>.
In a B<context-free grammar>,
like those Marpa parses,
the left hand side of a production 
is always a symbol string of length 1.
The right hand side of a production is a symbol string of zero or more symbols.
In the example, C<Expression> is the left hand side, and 
C<Term> and C<Factor> are right hand side symbols.

Left hand side and right hand side are often abbreviated as B<RHS> and B<LHS>.
If the RHS of a production has no symbols,
the production is called an B<empty production>
or an B<empty rule>.

Any symbol which is allowed to occur
in the symbol string input is called a B<terminal> symbol.
If the symbols in a symbol string are all terminals,
that symbol string is also called a B<sentence>.

=head2 Derivations

A B<step> of a derivation, or B<derivation step>, is made by taking a symbol string
and any production in the grammar whose LHS occurs in that symbol string,
and replacing any occurrence of the LHS symbol in the symbol string with the
RHS of the production.  For example, if C<A>, C<B>, C<C>, C<D>, and C<X> are symbols,
and

=for Marpa::XS::Display:
ignore: 1

    X ::= B C

=for Marpa::XS::Display::End

is a production, then

=for Marpa::XS::Display:
ignore: 1

    A X D -> A B C D

=for Marpa::XS::Display::End

is a derivation step, with "C<A X D>" as its beginning and "C<A B C D>" as its end or result.
We say that the symbol string "C<A X D>"
B<derives> the symbol string
"C<A B C D>".

A B<derivation> is a sequence of derivation steps.
The B<length> of a derivation is its length in steps.

=over

=item *
We say that 
a first symbol string B<directly
derives> a second symbol string if and only if there is a derivation of length 1 from the first symbol
string to the second symbol string.

=item *
Every symbol string is said to derive itself in a derivation
of length 0.  Such a zero length derivation is a B<trivial derivation>.

=item * If a derivation is not trivial or direct, that is, if it has more than one step,
then it is an B<indirect> derivation.

=item * A derivation which is not trivial
(that is,
a derivation which has one or more steps)
is a B<non-trivial> derivation.

=back

If the symbol string of length 1
derives a second symbol string of any length,
then the symbol contained in the first string
is said to derive the second symbol string.
A derivation from a symbol is said to be
B<trivial>,
B<non-trivial>, B<direct> or B<indirect>,
according to whether
the derivation from the symbol string of length 1 containing that
symbol is B<trivial>.
B<non-trivial>, B<direct> or B<indirect>.
When a symbol produces or derives a symbol string,
we also say that the symbol B<matches> the symbol string,
or that the symbol string B<matches> the symbol.

In any parse, one symbol is distinguished as the B<start symbol>.
The parse of an input is B<successful>
if and only if the start symbol produces the input sentence
according to the grammar.

=head2 Nulling

The B<length> of a symbol string is the number of symbols in it.
The zero length symbol string is called the B<empty string>.
The empty string can be considered to be a sentence, in which
case it is the B<empty sentence>.
A string of one or more symbols is B<non-empty>.
A derivation which produces the empty string is a B<null derivation>.
A derivation from the start symbol which produces the empty string
is a B<null parse>.

If in a particular grammar, a symbol has a null derivation,
it is a B<nullable symbol>.
If, in a particular grammar,
the only sentence produced by a symbol is the empty sentence,
it is a B<nulling symbol>.
All nulling symbols are nullable symbols.

If a symbol is not nullable, it is B<non-nullable>.
If a symbol is not nulling, it is B<non-nulling>.
In any instance where a symbol produces the empty string,
it is said to be B<nulled>,
or to be a B<null symbol>.

=head2 Useless Rules

If any derivation from the start symbol uses a rule,
that rule is called B<reachable> or B<accessible>.
A rule that is not accessible
is called B<unreachable> or B<inaccessible>.
If any derivation which results in a sentence uses a rule,
that rule is said to be B<productive>.
A rule that is not productive is called B<unproductive>.
A simple case of an unproductive rule is one whose RHS contains a symbol which is not
a terminal and not on the LHS of any other rule.
A rule which is inaccessible or unproductive is called a
B<useless> rule.
Marpa can handle grammars with useless rules.

A symbol is B<reachable> if it appears in a reachable production.
A symbol is B<productive> if it appears on the LHS of a productive rule,
or if it is a nullable symbol.
If a symbol is not reachable or not accessible,
it is B<unreachable> or B<inaccessible>.
If a symbol is not productive,
it is B<unproductive>.
A symbol which is inaccessible or unproductive is called a
B<useless> symbol.
Marpa can handle grammars with useless symbols.

=head2 Recursion and Cycles

If any symbol in the grammar non-trivially produces a symbol string containing itself,
the grammar is said to be B<recursive>.
If any symbol non-trivially produces a symbol string with itself on the left,
the grammar is said to be B<left-recursive>.
If any symbol non-trivially produces a symbol string with itself on the right,
the grammar is said to be B<right-recursive>.
Marpa can handle all recursive grammars,
including
grammars which are left-recursive,
grammars which are right-recursive,
and grammars
which contain both left- and right-recursion.

A B<cycle> is a non-trivial derivation
from a string of symbols to itself.
Since the result of a cycle is exactly the same as the beginning of
a cycle, a cycle is the parsing equivalent of an infinite loop.
A grammar which contains no cycles is B<cycle-free>.

A grammar which contains a cycle is called B<infinitely ambiguous>,
because a grammar with a cycle can repeat that cycle in a derivation
an arbitrary number of times.
Marpa can handle cycles,
and can produce parses for
infinitely ambiguous grammars.

=head2 Structure

The structure of a parse can be represented as a series of derivation steps from
the start symbol to the input.
Another way to represent structure is as a B<parse tree>.
Every symbol used in the parse is
represented by a B<node> of the parse tree.
Wherever a production is used in the parse,
its LHS is represented by a B<parent node>
and the RHS symbols are represented by B<child nodes>.
The start symbol is the B<root> of the tree.
The node at the root of the tree is called the B<start node>.

Terminals are B<leaf nodes>.
Any node with a symbol being used as a terminal is a B<terminal node>.
In Marpa, symbols on the
left hand side of empty productions are also B<leaf nodes>.
If a node is not a B<leaf node>, it is an B<inner node>.
A B<nulled node> is one that represents a nulled symbol.

If, for a given grammar and a given input,
more than one derivation tree is possible,
we say that parse is B<ambiguous>.
The standard definition of an ambiguous grammar is that,
if any parse from a grammar is ambiguous,
the grammar is B<ambiguous>.

Marpa is unusual in that it allows ambiguous B<input>,
so for Marpa this definition needs to be modified.
In the Marpa context, a grammar is called ambiguous,
if for any unambiguous input,
some parse from that grammar is ambiguous.
Marpa can handle ambiguous grammars, whether ambiguity
is defined in the standard way,
or used with its Marpa-specific
meaning.

=head2 Semantics

In real life, the structure of a parse is usually a means to an end.
Grammars usually have a B<semantics> associated with them,
and what the user actually wants is the B<value> of the parse
according to the semantics.

The tree representation is especially useful when evaluating a parse.
In the traditional method of evaluating a parse tree,
every node which represents a terminal symbol
has a value associated with it on input.
Non-null inner nodes 
take their semantics from the production whose LHS they represent.
Nulled nodes are dealt with as special cases.

The semantics for a production
describe how to calculate the value of the node which represents the LHS
(the parent node)
from the values of zero or more of the nodes which represent the RHS symbols
(child nodes).
Values are computed recursively, bottom-up.
The value of a parse is the value of its start symbol.

=head1 COPYRIGHT AND LICENSE

=for Marpa::XS::Display
ignore: 1

  Copyright 2011 Jeffrey Kegler
  This file is part of Marpa::XS.  Marpa::XS is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.
  
  Marpa::XS is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.
  
  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::XS.  If not, see
  http://www.gnu.org/licenses/.

=for Marpa::XS::Display::End

=cut