Name
Marpa::R2::BNF - BNF Interface
Synopsis
use Marpa::R2;
my $grammar = Marpa::R2::Grammar->new(
{
actions => 'My_Actions',
default_action => 'do_first_arg',
source => \(<<'END_OF_SOURCE'),
:start ::= Script
Script ::= Expression+ separator => <op comma> action => do_script
Expression ::=
Number
| (<op lparen>) Expression (<op rparen>) action => do_parens assoc => group
|| Expression (<op pow>) Expression action => do_pow assoc => right
|| Expression (<op times>) Expression action => do_multiply
| Expression (<op divide>) Expression action => do_divide
|| Expression (<op add>) Expression action => do_add
| Expression (<op subtract>) Expression action => do_subtract
END_OF_SOURCE
}
);
Overview
This page is the reference for the Marpa's "BNF interface", also called its "Stuifzand interface". The Stuifzand interface allows the grammar to be specified using a string whose text is in a variation of BNF. BNF source strings are specified using the the source
named argument of Marpa::R2's grammars. BNF source strings perform the functions of the rules
and start
named arguments of Marpa grammars. When the source
named argument is used to specify a grammar, the rules
and start
named arguments should not be used, and vice versa.
The Stuifzand interface is an example of a domain-specific language (DSL). The DSL of this document can (and has been) used to specify other DSL's, and therefore can be seen as a "meta-DSL".
The source file
Structure
Whitespace separates tokens, but is otherwise ignored. Comments are equivalent to whitespace. A hash ("#
") character starts a comment, which continues to the end of the line.
The Stuifzand format consists of rule declarations. Each rule declaration consists of, in order:
A left hand side (LHS). This will be a symbol or a pseudo-symbol.
The declaration operator ("
::=
").A right hand side (RHS). For each type of rule, the RHS will be described in detail below. In the case of empty rules, the RHS will be empty.
An adverb list. Each adverbs consist of a keyword, the adverb operator ("
=>
"), and the adverb's value. The adverb list may be zero length.
Symbol names
Symbol names can be either "bare" or enclosed in angle brackets. Bare symbol names must consist entirely of Perl word characters (alphanumerics, plus the underscore). The angle brackets, if used, serve to "quote" the symbol name, and will not be part of the symbol name.
If angle brackets are used, symbol names may also contain whitespace, as in
<op comma>
A whitespace sequence inside angle brackets can include any whitespace character that is legal in Perl, including newlines. This allows very long symbol names to be line wrapped, if necessary.
Unlike angle brackets, the whitespace does become part of the symbol name, but it does so in a "normalized" form. Leading and trailing whitespace in the name is discarded, and all other whitespace sequences are converted to a single ASCII space character. This means that
< op comma >
<op comma>
< op comma>
and even
<op
comma>
will all be regarded as the same symbol name. The normalized form of that symbol name is <op comma>
. The actual name of a symbol is the string inside the brackets of its normalized form.
Types of rules
Rules are of four kinds: the start rule, empty rules, quantified rules and prioritized rules. A rule like the following,
a ::= b c d
whose RHS consists only of a sequence of symbols, is treated as a prioritized rule with only one alternative and only one priority. In a typical large grammar, most rules are prioritized rules, many of them taking a simple form like that in the display above.
Start rule
Every source file should have a start rule. The LHS of this rule is the :start
pseudo-symbol. The RHS must be a single symbol name. The start rule does not accept any adverbs.
Empty rules
An empty rule is a rule with an empty RHS. A empty right hand side is what the name suggests, no symbols, adverbs or punctuation. A empty rule makes its LHS symbol a nullable symbol. Empty rules may have an action
adverb.
Quantified rules
A rule is quantified if its RHS consists of a single symbol name, followed by a "quantifier". The quantifer is either a star ("*
"), or a plus sign ("+
") indicating, respectively, that the sequence rule has a minimum length of 0 or 1. Quantified rules allow adverbs.
Prioritized rules
Any rule which is not a start rule, and not empty or quantified, is considered prioritized. The prioritization may be simple, consisting only of a single priority level.
Each prioritized right hand side consists of one or more right hand alternatives (called "alternatives" for brevity). An alternative is a series of one or more symbols, followed by zero or more adverbs.
The alternatives are separated by single or double "or" symbols. This means the alternatives in a prioritized right hand side proceed from tightest (highest) priority to loosest. The double "or" symbol ("||
") is the "loosen" operator -- the alternatives after it have a looser (lower) priority than the alternatives before it. The single "or" symbol ("|
") is the ordinary "alternative" operator -- alternatives on each side of it have the same priority. Associativity is specified using adverbs, as described below.
Within an alternative, symbols may be enclosed in parentheses. A symbol enclosed in parentheses is hidden from Marpa's semantics. A set of parentheses may contain more than one symbol, in which case the entire sequence of symbols is hidden, as if they had been enclosed in parentheses individually.
Adverbs
Adverbs are allowed for empty rules, quantifiied rules, and for the right hand side alternatives in prioritized rules. Adverbs must occur after the symbols. Adverbs consist of a keyword, the adverb operator ("=>
"), and the adverb's value. The keyword must be one of those described in this section. The adverb's value must be as described for each keyword.
action
The value of the action
keyword must be a Perl word (a string of alphanumerics or underscores). It will be interpreted as the action name, as described for the action
named argument of rule descriptors.
assoc
The assoc
adverb is only valid in a prioritized rule. Its value must be one of left
, right
or group
. Its effect will be as described below.
separator
The separator
keyword is only valid for a quantified right side, and its value must be a symbol of the grammar. It will be used as the separator, as described for sequence rules.
proper
The proper
keyword is only valid for a quantified right side, and its value must be a boolean, in the form of a binary digit (0
or 1
). It indicates proper or Perl separation, as described for sequence rules.
Precedence
Marpa's precedence is a generalization beyond the traditional ideas of precedence. Traditional precedence parsing required the the classification of operators as postfix, infix, etc. Marpa's precedence parsing is NOT based on the special treatment of operators. Marpa's precedence parsing has no concept of an operator, except as something that is not an operand.
An occurrence of the left hand side in a right hand side alternative is called an operand. The arity of an alternative is the number of operands that it contains. All arities are allowed, from zero to the arbitrary number imposed by system limits such as memory and file size.
An alternative with arity 0 is nullary. Precedence and associativity are meaningless in this case and will be ignored.
An alternative with arity 1 is unary. Precedence will have effect, but left and right associativity will not.
An alternative with arity 2 is binary. Precedence will have effect, and left and right associativity will behave in the traditional way, The traditional behavior for binary alternatives is exactly as described next for the N-ary case.
An alternative with an arity of N, where N is 2 or greater, is N-ary. Precedence will have effect. For left associativity, only the leftmost operand of an N-ary alternative associates -- operands after the first will have the next-tightest priority level. For right associativity, only the rightmost operand of an N-ary alternative associates -- all operands except the last will have the next-tightest priority level.
Marpa also allows "group" associativity. In "group" associativity, all operands associate at the loosest (lowest) priority. That is, in an alternative with group associativity, each operand may be a full expression of the kind defined by the prioritized rule. "Group" associativity is used, for example, in implementing the traditional function of parentheses in Marpa. Group associativity is meaningless for nullary alternatives, and is ignored.
Caveats
Symbol names
The BNF interface may rewrite its symbol names internally. Here we call the name by which a symbol is specified in the BNF interface string its "external symbol name", and its name as rewritten, its "internal symbol name". In many cases, Marpa::R2 makes sure the user sees only the external names of symbols. For example, the rule()
Grammar method returns external symbol names.
However, for some purposes, such as tracing grammars, the use of internal names may sometimes be appropriate. Marpa::R2's use of external and internal symbol names is evolving. When the Stuifzand interface is used, except as documented in this section, the user may see either the internal or external form of a symbol's name. Also, the exact form of the internal name is subject to change.
Precedence: the good news
Marpa's generalization of precedence works for all grammars that can be defined by prioritized rules. It is efficient (linear) for all grammars that could be parsed by the traditional precedence parsing methods. Marpa also allows you to define alternatives not allowed by traditional methods. Many of these are useful, and most of those can also be parsed efficiently.
Precedence and ambiguous grammars
Because of the many forms of recursion allowed, it is possible to define highly ambiguous grammars using the precedence mechanism. This can occur even by accident.
The user should especially be careful with right hand side alternatives in which all the symbols are operands. These can be useful. For example, an implicit operation can be defined using an binary alternative with no non-operands, and this could implement, for example, the standard notation for concatenation or multiplication. But to do this efficiently requires either avoiding ambiguity, or controlling its use carefully.
Marpa does catch the case where an alternative consists only of a single operand -- a "unit rule". This causes a fatal error. Unit rules are easy to define by accident in the Stuifzand interface. The author knows of no practical use for them, and their presence in a grammar is usually unintentional. Note that, in the event an application does find a use for a grammar with unit rules, Marpa's standard interface can parse it.
Copyright and License
Copyright 2012 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.