NAME
Marpa::R3 - Release 3 of Marpa
Synopsis
use Marpa::R3;
my $dsl = <<'END_OF_DSL';
Calculator ::= Expression action => ::first
Factor ::= Number action => ::first
Term ::=
Term '*' Factor action => do_multiply
| Factor action => ::first
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
my $grammar = Marpa::R3::Scanless::G->new( { source => \$dsl } );
my $input = '42 * 1 + 7';
my $value_ref = $grammar->parse( \$input, 'My_Actions' );
sub My_Actions::do_add {
my ( undef, $values ) = @_;
my ( $t1, undef, $t2 ) = @{$values};
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $values ) = @_;
my ( $t1, undef, $t2 ) = @{$values};
return $t1 * $t2;
}
Description
Marpa::R3 is ALPHA and EXPERIMENTAL
Marpa::R3 is in alpha status, and is experimental. Interfaces will change without notice. Agressive refactoring is likely, and this carries the risk of introducing bugs. Some features will be removed, and some releases will contain bugs which are only fixed in subsequent releases with a different interface.
Please do not use Marpa::R3 for production.
Overview
Marpa parses any language whose grammar can be written in BNF. That includes recursive grammars, ambiguous grammars, infinitely ambiguous grammars and grammars with useless or empty productions. Marpa does both left- and right-recursion in linear time -- in fact if a grammar is in any class currently in practical use, Marpa will parse it in linear time.
This document centers around a short tutorial of the Scanless interface (SLIF). This is the interface most suitable for beginners. The SLIF is the most suitable interface for most advanced uses as well.
A simple calculator
The synopsis shows the code for an extremely simple calculator. It handles only addition and multiplication of integers. The sections which follow explain, line by line, how it works. The explanation will assume that the reader understands BNF and the basics of grammars -- what rules are, what symbols are, what the start symbol of a grammar is, etc.
Marpa::R3::Scanless::G::new
my $dsl = <<'END_OF_DSL';
Calculator ::= Expression action => ::first
Factor ::= Number action => ::first
Term ::=
Term '*' Factor action => do_multiply
| Factor action => ::first
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
my $grammar = Marpa::R3::Scanless::G->new( { source => \$dsl } );
The code first creates a new SLIF grammar. SLIF grammars are Marpa::R3::Scanless:G
objects. They are created with the Marpa::R3::Scanless:G::new constructor. The arguments to Marpa::R3::Scanless::G::new are references to hashes of named arguments. In the key/value pairs of these hashes, the hash key is the name of the argument, and the hash value is the value of the named argument.
In the example, there is only one named argument to the SLIF grammar constructor: source
. The value of source
must be a reference to a string in the SLIF's domain-specific language (DSL). In this example, the DSL consists of several rules and pseudo-rules.
A G1 rule
Next follows a G1, or structural rule. The first G1 rule in a script will usually be the start rule of the grammar. (It is also possible to specify the start rule explicitly.)
Structural rules are the kinds of rules typically seen in BNF -- they describe the symbols which provide the structure of the grammar, but leave out details of whitespace. The SLIF also handles the lexical details in this example. It does this via L0 rules, which we will see shortly.
Calculator ::= Expression action => ::first
As is normal for BNF rules, the first rule consists of a left hand side symbol ("Calculator
"), the BNF operator ("::=
") and a series of right hand side (RHS) symbols. There is always exactly one left hand side (LHS) symbol. There may be any number of RHS symbols. In the case of an empty rule, the number of RHS symbols would be zero. In this rule, there is one RHS symbol, "Expression
".
The BNF operator ("::=
") is what makes this rule a G1 (structural) rule. Later we will see lexical rules, which will use the match operator ("~
").
After the rule is an adverb: action => ::first
. We'll explain the purpose of the action
adverbs when we discuss semantics
The second rule is very similar to the first:
Factor ::= Number action => ::first
More complicated G1 rules
Term ::=
Term '*' Factor action => do_multiply
| Factor action => ::first
This rule says that an Term
may be one of two alternatives:
A
Term
and aFactor
separated by an multiplication operator; ora
Factor
.
Immediately following is another G1 rule defining a Term
. It is very similar in form to the one for Expression
.
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
L0 rules
The structural rules define the high-level structure of the grammar, and ignore details of whitespace, comments, etc. Now we look at how the low-level, lexical issues are handled. This very simple calculator language does not allow comments, but it does define whitespace.
:discard ~ whitespace
whitespace ~ [\s]+
The :discard
rule is a pseudo-rule, which tells Marpa to use whatever it matches to separate G1 symbols, but otherwise to ignore it -- to "discard" it. whitespace
is defined in the next rule as a sequence of one or more spaces.
Note the match operator ("~
") in the rule defining whitespace. It tells Marpa that this rule is lexical and should be interpreted exactly as written, character by character.
The whitespace
rule is a special kind of rule in two respects. First, its RHS is followed by a quantifier ("+
"), which makes it a sequence rule. Aside from the quantifier, sequence rules may only have a single symbol or character class on their RHS. The plus quantifier ("+
") means a sequence of one or more items. The star quantifier ("*
") is also allowed, and it indicates a sequence of zero or more items.
The whitespace items are defined by a character class: [\s]
. Marpa supports the same character classes, and the same character class syntax, as Perl does.
The next pair of L0 rules define the Number
symbol
Number ~ digits
digits ~ [\d]+
The above two rules say that a Number
is a sequence of one or more digits. Number
is a lexeme -- a G1 symbol which is defined and recognized at the lexical (L0) level. In this example, there are three other lexemes: whitespace
, and the addition and multiplication operators.
We've already looked at the whitespace
lexeme, which will be discarded without being seen by G1. The addition and multiplication operators were defined with single quoted strings in the G1 rules. As a reminder, here's the rule for Term
again:
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
In the above rule, the single-quoted string '+'
implicitly defines a L0 lexeme. Something similar happens with the '*'
string in the rule for a Term
.
Among the acceptable tokens, Marpa looks for longest matches. If more than one token is acceptable according to the grammar, then the longest match is the winner.
What if the longest match is a tie? Marpa tolerates ambiguity. If more than one token tie for longest token, then all of these ``longest'' tokens are considered in the parse. The logic of SLIF lexing is described with more precision in the SLIF overview document.
Marpa::R3::Scanless::G::parse
my $input = '42 * 1 + 7';
my $value_ref = $grammar->parse( \$input, 'My_Actions' );
To parse a string, we use the Marpa::R3::Scanless::G::parse()
method. Marpa::R3::Scanless::G::parse()
requires a reference to a string as its first argument. Optionally, the second argument is another string specifying the "semantics package". The "semantics_package
" tells Marpa the name of the Perl package that contains the closures implementing the semantics for this grammar. We will talk more about this below.
Semantics
The value of the parse result, as returned via the parse()
method, is determined by the parse's semantics. Marpa's semantics are the traditional ones: The input is seen as a tree which takes its structure from the G1 rules. (This is why the G1 rules are called structural.) The value of the parse results from repeatedly evaluating nodes of this tree, starting at the bottom, with the results of child nodes made available to their parent node when the parent node is evaluated.
Parse trees are usually drawn upside-down with their root at the top, and their "leaves" at the bottom. In Marpa::R3's SLIF, the "leaves" are the symbols that the G1 (structural) rules share with the L0 (lexical) rules. The symbols shared by L0 and G1 are those lexemes which are not discarded. In this example, the lexemes visible to G1 are Number
and two operators which are specified with a quoted string: "+
" and "*
".
Marpa assigns values to the nodes of the tree, starting with the leaves. Marpa's "leaves" will always be L0 symbols, and their value by default is the literal value at their location in the input stream. In the case of the two operators described by quoted string, the value is that quoted string. That is, the value of '+
' is '+
', and the value of '*
' is '*
'. The value of Number
will be the portion of the input that matched the [\d]+
pattern.
Starting with the values for leaves, Marpa::R3 moves recursively "up" the tree to its root, assigning a value to each node of the tree based on the value of its child nodes. Each non-leaf node corresponds to a G1 rule, and the children of the non-leaf node correspond to the RHS symbols of the rule. When the non-leaf node is valued, its value becomes the value of its LHS symbol, and this value will become the value of a RHS symbol of another node with one exception.
The one exception, the node with a LHS symbol that does not become a RHS symbol, is the value of the top (or "root") node. The value of the top node becomes the value of the parse, and this is the parse result value to which the value()
method returns a reference.
Each non-leaf node determines its value with an action. Often actions are Perl functions, which in this context are called Perl semantic closures.
my $value_ref = $grammar->parse( \$input, 'My_Actions' );
When we did the parse, we used the semantics_package
named argument. The value of the semantics_package
argument specifies the package that is used to find the Perl semantic closures.
By default, a Marpa rule returns a reference to an array consisting of the rule's name, followed by the values of its children. This semantics, which corresponds to the array descriptor "[name,values]
", is an excellent starting point when you are first developing a DSL, which is why it is the default. For more about array descriptors, see the semantics document
In this example, Marpa's default semantics is not actually used -- we always override it. In this example, we specify semantics by using an action
adverb for each RHS alternative. We've seen the action
adverb several times, but skipped over it. Now it is time to look at it.
Term ::=
Term '*' Factor action => do_multiply
| Factor action => ::first
Expression ::=
Expression '+' Term action => do_add
| Term action => ::first
The "::first
" action indicates that the value of a rule is to be the value of its first child, that is, the value corresponding to the first symbol of the rule's RHS. (In the case of an empty rule, the value would be a Perl undef
). (The initial double colon indicates a reserved action.)
The action for the first RHS alternative defining Expression
is do_add
, and the action for the first RHS alternative defining Term
is do_multiply
. To implement these actions, we need to "resolve" their names -- map the action names into the Perl closures which actually carry out the semantics.
The semantics_package
specified the package where we can find the actions: "My_Actions
". So, to resolve the do_multiply
action, Marpa looks for a closure whose fully qualified name is My_Actions::do_multiply
, which it finds:
sub My_Actions::do_multiply {
my ( undef, $values ) = @_;
my ( $t1, undef, $t2 ) = @{$values};
return $t1 * $t2;
}
The do_add
action is resolved to a Perl semantic closure in much the same way:
sub My_Actions::do_add {
my ( undef, $values ) = @_;
my ( $t1, undef, $t2 ) = @{$values};
return $t1 + $t2;
}
The Perl semantic closures are callbacks. They are called as each node in a parse tree is evaluated.
Each Perl semantic closure is called with one or more arguments. The first argument to a value action is always a per-parse-tree object, which the callbacks can use as a scratchpad. In this example, the per-parse-tree object is not used. The remaining arguments will be the values of the node's "children" -- in other words, the values computed for each of its RHS symbols, in order. If the action is for an empty rule, the per-parse-tree object will be its only argument.
Every value action is expected to return a value. With one exception, this value is passed up to a parent node as an argument. The exception is the value for the start rule. The return value for the start rule becomes the parse result.
Tainted data
Marpa::R3 exists to allow its input to alter execution in flexible and powerful ways. Marpa should not be used with untrusted input. In Perl' s taint mode, it is a fatal error to use Marpa's SLIF interface with a tainted grammar, a tainted input string, or tainted token values.
Threads
Perl interpreter-based threads are now officially "discouraged". Nonetheless, Marpa::R3 supports them, with one restriction: A Marpa grammar object, and any recognizer or other object created from that grammar, must be used within a single thread.
This restriction is less severe than it may sound: Marpa grammars are relatively inexpensive to create, and several Marpa grammar objects can be created from the same SLIF source.
The Marpa:: namespace
The Marpa::
top-level namespace is reserved. For extensions to Marpa, one appropriate place is the MarpaX::
namespace. This practice helps avoid namespace collisions, and follows a CPAN standard, as exemplified by the DBIx::
LWPx::
and MooseX::
which are for extensions of, respectively, DBI, LWP and Moose.
Other documents
This document gives a semi-tutorial overview of Marpa's Scanless interface (SLIF). For a continuation of this tutorial, which describes how to get finer control of Marpa and access more of its features, see the followup tutorial to this one. If you are beginner who wants to learn more about Marpa, you probably want to go next to the overview of the SLIF interface, and then the pages describing its DSL, its grammar methods, and its recognizer methods.
Marpa's standard semantics are fully described in the Marpa::R3::Semantics document. Techniques for tracing and for debugging your Marpa grammars are described in the Marpa::R3::Tracing document and the Marpa::R3::Progress document.
Author
Jeffrey Kegler
Why is it called "Marpa"?
Marpa is the name of the greatest of the Tibetan "translators". In his time (the 11th century AD) Indian Buddhism was at its height. Marpa's generation of scholars was devoted to producing Tibetan versions of Buddhism's Sanskrit scriptures. Marpa became the greatest of them, and today is known as Marpa Lotsawa: "Marpa the Translator".
Blatant plug
Marpa is a character in my novel, The God Proof. The God Proof centers around Kurt Gödel's proof of God's existence. Yes, that Kurt Gödel, and yes, he really did work out a God Proof (it's in his Collected Works, Vol. 3, pp. 403-404). The God Proof is available as a free download (http://www.lulu.com/content/933192). It can be purchased in print form at Amazon.com: http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355.
Support
Marpa::R3 comes without warranty. Support is provided on a volunteer basis through the standard mechanisms for CPAN modules. The Support document has details.
COPYRIGHT AND LICENSE
Marpa::R3 is Copyright (C) 2016, Jeffrey Kegler.
This module is free software; you can redistribute it and/or modify it
under the same terms as Perl 5.10.1. For more details, see the full text
of the licenses in the directory LICENSES.
This program is distributed in the hope that it will be
useful, but without any warranty; without even the implied
warranty of merchantability or fitness for a particular purpose.