NAME
Parse::Marpa - (pre-Alpha) Jay Earley's general parsing algorithm, with LR(0) precomputation
VERSION
Pre-alpha Version
This is strictly a developer's version. Nothing useful will be found here, and the documentation is also inchoate. Those not developing this module will want to wait for at least a released, beta version.
I've no personal experience with them, but Parse::Yapp
and Parse::RecDescent
are alternatives to this module which are well reviewed and much more mature and stable. Until Parse::Marpa
goes at least beta, users with a deadline or targeting anything close to a production environment should look there.
SYNOPSIS
UNDER CONSTRUCTION ...
The Easy Way
It's possible to specify the grammar and the text to be parsed all in one step
use Parse::Marpa;
my $value = Parse::Marpa::marpa(\$grammar, \$text_to_parse);
and you can even include options if you make a hash ref the third argument.
my $value = Parse::Marpa::marpa(
\$grammar,
\$text_to_parse
{
warnings => 1,
...
}
);
You can get all the values of an ambiguous parse by invoking Parse::Marpa::marpa
in list context.
my @values = Parse::Marpa::marpa(\$ambiguous_grammar, \$text_to_parse);
Exercising more control
You can also proceed step by step. For example, ...
use Parse::Marpa;
my @tests = split(/\n/, <<'EO_TESTS');
time / 25 ; # / ; die "this dies!";
localtime / 25 ; # / ; die "this dies!";
EO_TESTS
First creating the grammar object ...
my $g = new Parse::Marpa(
warnings => 1,
code_lines => -1,
);
Then setting the grammar ...
my $mini_perl_grammar; { local($RS) = undef; $mini_perl_grammar = <DATA> };
$g->set( source => \$mini_perl_grammar);
And then as many times as you like ...
TEST: while (my $test = pop @tests) {
Create a parse object ...
my $parse = new Parse::Marpa::Parse($g);
Pass text to the recognizer ...
$parse->text(\$test);
Set an initial parse ...
$parse->initial();
my @parses;
push(@parses, $parse->value);
Get others, if there are any ...
while ($parse->next) {
push(@parses, $parse->value);
}
And then go on with your other processing ...
say "I've got ", scalar @parses, " parses:";
for (my $i = 0; $i < @parses; $i++) {
say "Parse $i: ", ${$parses[$i]};
}
}
__DATA__
...
DESCRIPTION
UNDER CONSTRUCTION AND VERY INCOMPLETE
Overview
Parse::Marpa
parses text given an arbitrary context-free grammar.
Grammar may be anything which can be specified in BNF.
This includes left-recursive grammars, right-recursive grammars, grammars with empty productions, grammars with cycles and ambigious grammars.
In fact, ambiguous grammars are a Marpa specialty. They are useful even when you never want more than one parse. Human languages have an ambiguous grammars, from which the listener pulls the parse that makes most sense. An ambiguous grammar is often the easiest and most sensible way to express a language. Marpa allows the user to prioritize rules so that the preferred parse comes up first.
Of course, you can also have
Marpa
return all the parses of an ambiguous grammar.Marpa allows predictive lexing and uses LR(0) precomputation. It incorporates the latest academic research for speeding up Earley's algorithm, along with further innovations.
The Easy Way function: Parse::Marpa::marpa()
- Parse::Marpa::marpa(\$grammar, \$text_to_parse, option hash);
-
The
marpa
function takes three arguments: a ref to a string containing a Marpa source description of the grammar; a ref to a string with the text to be parsed; and (optionally) a ref to a hash with options. If there are no parses,marpa()
returns failure. If there are parses, andmarpa()
was called in list context, it returns a list of references to the values of all the parses. If there are parses, andmarpa
was called in scalar context, it returns a reference to the value of the first parse.
Marpa Grammar Source Files: The Format
Overview
Not all parsers are capable of parsing their own description languages. The source files specifying grammars to Marpa are parsed by Marpa in the same way that Marpa parses user-specified grammars. In fact, the grammar for Marpa's grammar description language is written in that grammar description language.
It is entirely possible for a Marpa user to write their own interface language to replace Marpa's. The result could be easily be an improvement on the original in expressive power or efficiency. Humbling as it is to admit, it might be an improvement in both respects. There are better language designers than me out there and my goal with Marpa is to empower them.
Marpa's self-describing grammar source file is in the distribution, and is named self.marpa
. Most of the examples of the Marpa grammar description language below are adopted from that file or previous versions of it.
Paragraphs and Sentences
The file is divided into paragraphs, separated by blank lines, that is lines containing only horizontal whitespace. Comments do not count as whitespace for the purpose of separating paragraphs. Paragraphs contain sentences, which must end in a period.
There are definition paragraphs, production paragraphs and terminal paragraphs. Definition paragraphs contain one or more definitions. For example,
semantics are perl5. version is 0.1.59. the start symbol is
grammar.
Reserved words in the Marpa Grammar Descriptions
Case matters in all the reserved words and names in the Marpa grammar description language. They must be in lower case. This is in line with the position Larry Wall took in his 2007 "State of the Onion" talk. The idea is that what the user is doing should be emphasized over the framework of the language.
User Specified Names
User specified names are case-indifferent. "Symbol" is the same name as "SYMBOL", "symbol" and even "sYmBoL". The user may use case as an expressive element, or to distinguish his names from Marpa's keywords.
Names may be more than one word and may be separated by whitespace or hyphens as the user chooses. User names are separation-indifferent as well as case-indifferent. "My symbol" and "my-symbol" are the same name.
A user specified name may be all lowercase, just as Marpa's keywords are required to be. This allows the user to reuse the Marpa description's keywords for his purposes. For example, in Marpa's self-definition, the following occurs:
rhs element: Optional rhs element.
As you'll see elsewhere "optional" is a Marpa keyword which can be meaningful in that context, and if "optional" were lower case in this example, the Marpa's preferred parse would result in this sentence being interpreted as a rule which states that a rhs element
can consist either of nothing, or else itself. Marpa allows both circular and null rules, and so this rule, which probably not a good idea, is perfectly legal.
However, since "Optional" begins with a capital letter, it must be a user name or part of one. And in fact, an "Optional rhs element" is defined later as
optional rhs element: /optional/, rhs symbol specifier.
As will be explained in more detail below, this states that an optional rhs element
is not an optional element at all. It an non-nullable and non-optional element, composed of the keyword <optional> followed by a rhs symbol specifier
.
Note that on the left hand side of this second rule, the "optional" in optional rhs element
is not capitalized. The "optional" keyword only makes sense on the left hand side, and no possible parse allows "optional" to be interpreted as a Marpa keyword, so it isn't.
Humans often use the same word to mean two different things, relying on context to resolve the ambiguity. This makes human languages powerful, compact and expressive. Other parsers allow the user to use ambiguity only to a very limited degree. In Perl 5, Larry Wall was probably about as aggressive in this respect as anyone using an LALR parser could be.
Definition Sentences
Definition sentences contain the name of a Marpa predefined and its value, separated by the word "is" or "are". The name of the predefined may be preceded by "the".
The following all define the semantics of the grammar to be Perl 5.
semantics are perl5.
perl5 is the semantics.
perl5 is semantics.
the semantics are perl5.
the semantics is perl5.
Note that "is" or "are" always works. Marpa can't be bothered figuring out whether semantics is (are?) really singular or plural, and I think it has the right attitude about this. Marpa is similarly liberal about "is" versus "are" for all names, whether Marpa predefined or user-specified.
The syntax for the other definitions of predefineds obeys the above rules, except where spefically noted otherwise.
Semantics definition
The semantics definition is not optional. Marpa is ultimately targeted to perl6, and that is considered its "default" semantics, even though it's not currently available. Currently, the only available semantics is perl5
.
I require every Marpa grammar description to contain a line explicitly stating that its semantics are Perl 5, in order to limit problems with old Marpa source files once Perl 6 becomes available and the default.
Version definition
version is 0.1.59.
This also is not optional, and as long as Marpa is in alpha, the version has to match exactly. This causes me a lot of trouble because all the test cases and examples and the bootstraping code must be edited whenever I up the version number, but nonetheless I regard it as a feature. It forces the user to be aware of version changes. This is essential while Marpa is in alpha, because versions will change frequently, and features will be volatile. There will be no attempt to maintain compatibility from version to version until Marpa goes beta.
Start symbol definition
the start symbol is grammar.
The start symbol is optional, in the sense that it may be specified later. If no start symbol has been specified by precomputation time, Marpa will fail.
String definitions
Strings are used in Marpa for several important purposes. Many of the definitions require strings. "Actions" (the semantics of the rules) are specified as strings with Perl 5 code in them. Custom lexers can also be specified, and these also are strings containing Perl 5 code.
Here's an example of a string definition:
concatenate lines is q{
my $v_count = scalar @$Parse::Marpa::This::v;
return undef if $v_count <= 0;
join("\n", grep { $_ } @$Parse::Marpa::This::v);
}.
Literal strings
Literal strings can be single quoted, double quoted, q
-quoted or qq
-quoted. The syntax is much the same in Perl 5. Marpa recognizes backslashes, and will not terminate a single- or double-quoted string at a delimiting quote when it is preceded by a backslash.
For the q
and qq
strings, Marpa allows as delimiters everything in the POSIX punct
character class, except backslash and the four right hand side bracketing symbols -- angle and square bracket, curly brace and parentheses. (These are the same restrictions that Perl 5 imposes, or are at least very close.) Backslashes escape the end delimiters in q
and qq
strings, just as they do in single and double-quoted strings.
Like Perl 5, Marpa treats a q
or qq
string with a left-hand bracketing symbol as a opening delimiter as special cases. The corresponding right hand bracketing symbol becomes the end-delimiter. Backslashes escape as usual. Nesting of the brackets within the quote is also tracked and the string will not terminate until there's an unescaped closing bracket at the same nesting level as the opening bracket.
Marpa's literal strings are often Perl 5 code, and it must be remembered that Marpa does not understand Perl 5 syntax. Treatment of brackets and string delimiters is more complex in Perl 5 than described above and Marpa's ideas of how to deal with them are limited to those described above.
This means that in complex cases, such as when an end delimiter appears in a Perl 5 character class, or within a string inside code Marpa's idea of where the closing delimiter is may not correspond to what Perl 5's would be. Users should choose from among Marpa's large variety of the quoting construct one that sidesteps potential issues.
Once a literal string is recognized, it's passed on unaltered to Perl 5 for the usual Perl 5 interpretation. The string literal is passed to Perl 5 delimited in the same way that it was in Marpa. You can, for example, use single and double quotes in the Marpa source file and expect Perl 5's interpretation of that string literal will follow the same rules as if you'd specified it directly to Perl 5.
Default action definition
You can specify a "default action", that is, an action to be used in rules which do not explicitly specify an action. By default, rules return a return a "no value" (which in Marpa is not the same as an undefined).
Marpa's evaluation code optimizes in the presence of "no value" returns, and you will seldom want to change it. The current version of self.marpa
uses the no-value "default default". Here's the specification of a default action from an earlier version.
concatenate lines is the default action.
In this case, the string specifier is concatenate lines
, the string name defined in an earlier example. In any definition which takes a string at the value, the string may be specified either by name or as a literal string.
Default null value definition
Error processing
Some of the Marpa functions and methods have error returns, as explained in the descriptions of those functions. Most often Marpa reports a problem by throwing an exception using croak()
. You can catch these using eval
, if you don't want them to be fatal.
Notes not yet properly incorporated in this document
Point out that lexing honors the pos setting and must not alter it, even on successful match. Warn user of counter-intuitive behavior.
Warn user that in the lex patterns all named captures beginning with any "MARPA_" in any capitalization variant ("marpa_", "MaRpA_", etc.) is reserved.
All Perl code supplied by the user via the Marpa source file by default is run with "use integer" in effect. If the user wants floating point arithmetic she must specify "no integer".
AUTHOR
Jeffrey Kegler
DEPENDENCIES
Requires Perl 5.10. Users who want or need to stick with Perl 5.8 or earlier probably are also best off with more mature alternatives to Marpa
.
IMPLEMENTATION NOTES
It's often and correctly said by those experienced in Perl that passing string refs instead of strings is a pointless or even counter-productive optimization and I agree. I believe that Marpa
is an exception and have designed the interface accordingly. The strings involved here can be very long and copying and recopying them a major waste of time.
BUGS
Please report any bugs or feature requests to bug-parse-marpa at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Parse::Marpa
You can also look for information at:
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
RT: CPAN's request tracker
Search CPAN
ACKNOWLEDGMENTS
Marpa is the parser described in John Aycock and R. Nigel Horspool's "Practical Earley Parsing", The Computer Journal, Vol. 45, No. 6, 2002, pp. 620-630. I've made significant changes to it, which are documented separately (Parse::Marpa::ALGORITHM). Aycock and Horspool, for their part, built on the algorithm discovered by Jay Earley, and described in his "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
In writing the Pure Perl version of Marpa, I benefited from studying the work of Francois Desarmenien (Parse::Yapp
), Damian Conway (Parse::RecDescent
) and Graham Barr (Scalar::Util
).
COPYRIGHT & LICENSE
Copyright 2007 Jeffrey Kegler, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.