NAME
Marpa::R2 - Release 2 of Marpa
Synopsis
use Marpa::R2;
my $grammar = Marpa::R2::Scanless::G->new(
{ action_object => 'My_Nodes',
default_action => '::first',
source => \(<<'END_OF_SOURCE'),
:start ::= Expression
Expression ::= Term
Term ::=
Factor
| Term '+' Term action => do_add
Factor ::=
Number
| Factor '*' Factor action => do_multiply
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_SOURCE
}
);
my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
my $input = '42 * 1 + 7';
$recce->read( \$input );
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
sub My_Nodes::new { return {}; }
sub My_Nodes::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Nodes::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
Description
Overview
Marpa parses any language whose grammar can be written in BNF. That includes recursive grammars, ambiguous grammars, infinitely ambiguous grammars and grammars with useless or empty productions. Marpa does both left- and right-recursion in linear time -- in fact if a grammar is in any class currently in practical use, Marpa will parse it in linear time.
This document centers around a short tutorial of the Scanless interface (SLIF), the interface most suitable for beginner's and for many advanced uses as well.
A simple calculator
The synopsis shows the code for a very simple calculator. It handles only addition and multiplication of integers. The sections which follow explain, line by line, how it works. The explanation will assume that the reader understands BNF and the basics of grammars, such as what a grammar's start symbol is.
Marpa::R2::Scanless::G::new
my $grammar = Marpa::R2::Scanless::G->new(
{ action_object => 'My_Nodes',
default_action => '::first',
source => \(<<'END_OF_SOURCE'),
:start ::= Expression
Expression ::= Term
Term ::=
Factor
| Term '+' Term action => do_add
Factor ::=
Number
| Factor '*' Factor action => do_multiply
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_SOURCE
}
);
The code first creates a new SLIF grammar. SLIF grammars are Marpa::R2::Scanless:G
objects. They are created with the Marpa::R2::Scanless:G::new constructor. The arguments to Marpa::R2::Scanless::G::new are references to hashes of named arguments. In the key/value pairs of these hashes, the hash key is the name of the argument, and the hash value is the value of the named argument.
In what follows, we will look at all the arguments of this call to the Marpa::R2::Scanless:G::new method in detail. First let's look at the source
argument. Its value is a reference to a string in the SLIF's domain-specific language (DSL). In this example, the DSL consists of several rules and pseudo-rules.
The start pseudo-rule
:start ::= Expression
The start
pseudo-rule is is required. Its right hand side (RHS) is the grammar's start symbol.
A G1 rule
Next follows a G1, or structural rule. Structural rules are the kinds of rules typically seen in BNF -- they describe the symbols which provide the structure of the grammar, but leave out details of whitespace.
The SLIF also handles the lexical details in this example, but it does it via G0 rules, which we will see shortly.
Expression ::= Term
As is normal for BNF rules, this rule consists of a left hand side symbol ("Expression
"), the BNF operator ("::=
") and a series of right hand side (RHS) symbols. There is always exactly one left hand side (LHS) symbol. There may be any number of RHS symbols. In the case of an empty rule, the number of RHS symbols would be zero. In this rule, there is one RHS symbol, "Term
".
The BNF operator ("::=
") is what makes this rule a G1 (structural) rule. Later we will see lexical rules, which will use the match operator ("~
").
More complicated G1 rules
Term ::=
Factor
| Term '+' Term action => do_add
This rule says that a Term
may be one of two alternatives: either a Factor
or two Term
's separated by an addition operator. The second alternative has an action
adverb associated with it. We'll explain its purpose shortly.
Immediately following is another G1 rule defining a Factor
. It is very similar in form to the one for Term
.
Factor ::=
Number
| Factor '*' Factor action => do_multiply
G0 rules
The structural rules define the high-level structure of the grammar, and ignore details of whitespace, comments, etc. Now we look at how the low-level, lexical issues are handled. This very simple calculator language does not allow comments, but it does define whitespace.
:discard ~ whitespace
whitespace ~ [\s]+
The :discard
rule is a pseudo-rule, which tells Marpa to use whatever it matches to separate G1 symbols, but otherwise to ignore it -- to "discard" it. whitespace
is defined in the next rule as a sequence of one or more spaces.
Note the match operator ("~
") in the rule defining whitespace. It tells Marpa that this rule is lexical and should be interpreted exactly as written, character by character.
The whitespace
rule is a special kind of rule in two respects. First, its RHS is following by a quanitifer ("+
"), which makes it a sequence rule. Aside from the quantifier, sequence rules may only have a single symbol or character class on their RHS. The plus quanitifer ("+
") means a sequence of one or more items. The star quanitifer ("*
") is also allowed, and it indicates a sequence of zero or more items.
The whitespace items are defined by a character class: [\s]
. Marpa supports the same character classes, and the same character class syntax, as Perl does.
The next pair of G0 rules define the Number
symbol
Number ~ digits
digits ~ [\d]+
The above two rules say that a Number
is a sequence of one or more digits. Number
is a lexeme -- a G1 symbol which is defined and recognized at the lexical (G0) level. In this example, there are three other lexemes: whitespace
, and the addition and multiplication operators.
We've already looked at the whitespace
lexeme, which will be discarded without being seen by G1. The addition and multiplication operators were defined with single quoted strings in the G1 rules. As a reminder, here's the rule for Term
again:
Term ::=
Factor
| Term '+' Term action => do_add
In the above rule, the single-quoted string '+'
implicitly defines a G0 lexeme. Something similar happens with the '*'
string in the rule for a Factor
.
The SLIF's lexer mostly "does what you mean". While the input is being read, it looks for all lexemes defined in the DSL. The search is a longest tokens match, if mulitple tokens could match, the longest match is the winner. Marpa tolerates ambiguity, so one feature special to Marpa is that it is a longest tokens match -- if more than one token is longest, all of them are considered in the parse. The logic of SLIF lexing is described with more precision in the SLIF overview document.
Marpa::R2::Scanless::R::new
my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
Marpa::R2::Scanless::R::new
creates a new SLIF recognizer. Its arguments are references to hashes of named arguments. In this example the only named argument is the required argument: "grammar
". The value of the grammar
named argument must be a Marpa SLIF grammar.
Marpa::R2::Scanless::R::read
my $input = '42 * 1 + 7';
$recce->read( \$input );
To parse a string, we use the Marpa::R2::Scanless::R::read()
method. In its simplest form, as here, the Marpa::R2::Scanless::R::read()
takes a reference to a string containing the input stream as its argument.
Marpa::R2::Scanless::R::value
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
The Marpa::R2::Scanless::R::value()
method returns a reference to the parse result's value, if there was a parse result. If there was no parse result, Marpa::R2::Scanless::R::value()
returns undef
.
We have yet to describe how the Marpa SLIF computes the value of a parse. In fact, up to this point, we have been skipping everything that had to do with semantics. Now it is time to go back to those features.
Semantics
action_object => 'My_Nodes',
default_action => '::first',
The action_object
named argument specifies the package where the semantics closures are found, in this example, the My_Nodes
package. The default_action
named argument indicates the action to be used when nothing else specifies the semantics for a closure. In this example the default semantics are a "reserved action" named "::first
". (The initial double colon indicates a reserved action.) The "::first
" action indicates that the value of a rule is the value of its first child.
The semantics for a RHS alternative are specified with the action
adverb. We've already seen the action
adverb twice, but skipped over it. Here it is again, in context:
Term ::=
Factor
| Term '+' Term action => do_add
Factor ::=
Number
| Factor '*' Factor action => do_multiply
The action for the second RHS alternative defining Term
is do_add
, and the action for the second RHS alternative defining Factor
is do_multiply
. To implement these actions, we need to "resolve" their names -- map the action names into the Perl closures which actually carry out the semantics.
The actions_object
specified the package where we can find the actions: "My_Nodes
". So, to resolve the do_multiply
action, Marpa looks for a closure whose fully qualified name is My_Nodes::do_multiply
, which it finds:
sub My_Nodes::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
The do_add
action is resolved to a Perl semantic closure in much the same way:
sub My_Nodes::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
The Perl semantic closures are callbacks. They are called as nodes in a parse tree are evaluated.
Each Perl semantic closure is called with one or more arguments. The first argument to a value action is always a per-parse-tree object, which the callbacks can use as a scratchpad. In this example, the per-parse-tree object is not used.
For a non-empty rule, the second and any subsequent arguments to the callback are the values, in lexical order, of the symbols on the right hand side of the rule. If the action is for an empty rule, the per-parse-tree object will be its only argument.
Every value action is expected to return a value. With one exception, this value is passed up to a parent node as an argument. The exception is the value for the start rule. The return value for the start rule becomes the parse result.
Finally, actions_object
requires the package it specifies to be a class, in the sense that it must have a new
constructor. The value returned by the new
constructor is the per-parse-tree object which, as mentioned, is not used in this example. Here is our definition of My_Nodes::new
:
sub My_Nodes::new { return {}; }
Errors and exceptions
Methods in the Marpa API usually do not return errors. When there are errors, Marpa API methods throw an exception.
Inheritance
Classes in Marpa's API's are not designed to be inherited.
Threads
When used in a thread-safe Perl, Marpa::R2 should be thread-safe, with one important restriction: All Marpa objects have a base grammar and all objects which share a base grammar write to it. All Marpa objects which share a base grammar must created and used in the same thread.
All Marpa objects are either grammars, or are created from other Marpa objects. When a Marpa object is created from another Marpa object, the child Marpa object inherits the base grammar of its parent object. Marpa grammars are their own base grammar.
This restriction may be lifted someday, but in practice it does not seem onerous. Note that what cannot be shared between threads is a base grammar object -- you can use the same grammar in different threads by creating base grammar that are exact copies of each other, one base grammar per thread.
The Marpa:: namespace
The Marpa::
top-level namespace is reserved. For extensions to Marpa, one appropriate place is the MarpaX::
namespace. This practice helps avoid namespace collisions, and follows a CPAN standard, as exemplified by the DBIx::
LWPx::
and MooseX::
which are for extensions of, respectively, DBI, LWP and Moose.
Other documents
This document gives a semi-tutorial overview of Marpa's Scanless interface (SLIF). For more details about the SLIF, there is an overview, and pages describing its DSL, its grammar methods, and its recognizer methods.
Marpa has two other interfaces which may be of interest. An older, more traditional, interface is the named argument inteface (NAIF). The NAIF is a middle level interface which uses Perl calls instead of a DSL. The NAIF is more low level than the SLIF.
There is also the the thin interface, which provides direct access to the underlying Libmarpa C library. Of the Perl interfaces to Marpa, the thin interface is the most low-level. The thin interface offers efficient access to the full power of the Marpa parse engine, but it requires the application to do a lot of the work itself.
Marpa::R2::Vocabulary is intended as a quick refresher in parsing terminology, emphasizing how the standard terms are used in the Marpa context. Marpa's standard semantics are fully described in the Marpa::R2::Semantics document. Techniques for tracing and for debugging your Marpa grammars are described in the Marpa::R2::Tracing document and the Marpa::R2::Progress document. For those with a theoretical bent, my sources, and other useful references, are described in Marpa::R2::Advanced::Bibliography.
Author
Jeffrey Kegler
Why is it called "Marpa"?
Marpa is the name of the greatest of the Tibetan "translators". In his time (the 11th century AD) Indian Buddhism was at its height. Marpa's generation of scholars was devoted to producing Tibetan versions of Buddhism's Sanskrit scriptures. Marpa became the greatest of them, and today is known as Marpa Lotsawa: "Marpa the Translator".
Blatant plug
Marpa is a character in my novel, The God Proof. The God Proof centers around Kurt Gödel's proof of God's existence. Yes, that Kurt Gödel, and yes, he really did work out a God Proof (it's in his Collected Works, Vol. 3, pp. 403-404). The God Proof is available as a free download (http://www.lulu.com/content/933192). It can be purchased in print form at Amazon.com: http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355.
Support
Marpa::R2 comes without warranty. Support is provided on a volunteer basis through the standard mechanisms for CPAN modules. The Support document has details.
Copyright and License
Copyright 2013 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.