NAME
Marpa Demonstration Language - a language for describing grammars to Marpa
BEWARE: THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE
THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE
OVERVIEW
The Marpa Demonstration Language (MDL) is a language for decribing grammars to Marpa. It's a high-level Marpa grammar interface -- as of this writing, the only one.
While it is Marpa's first high-level interface, it's intended not to have a privileged status within Marpa. Users can write their own high-level interfaces. It would be easy to write a more efficient one than the MDL. (Hint: don't use ambiguous lexing.) It would be also be easy to write a more powerful one. (Hint: just think of some cool feature and add it.)
And, humbling as it is to admit, it may not be all that hard to write a high-level interface to Marpa which is just plain better. There are better language designers than me out there. My goal with Marpa is to empower them with a better tool than any they've had up to now.
Not all parsers can parse languages describing their own grammar. Marpa can. In fact, the Marpa Demonstration Language is parsed using the same grammar interface and methods available to the user. A bootstrap is needed to get around the chicken-and-egg issues, but the result is then parses itself to produce itself. The self-generated parser it then run through a third generation and the second and third generation results are compared to ensure that trace of the bootstrapping is left. The Marpa parser is fully self-generated.
Marpa's self-describing grammar source file is in the distribution, and is named self.marpa
. Most of the examples of the Marpa grammar description language below are adopted from that file or previous versions of it.
ELEMENTS OF MARPA GRAMMAR DESCRIPTIONS
Paragraphs and Sentences
The file is divided into paragraphs, separated by blank lines, that is lines containing only horizontal whitespace. Comments do not count as whitespace for the purpose of separating paragraphs. Paragraphs contain sentences, which must end in a period.
There are definition paragraphs, production paragraphs and terminal paragraphs. Definition paragraphs contain one or more definitions. For example,
semantics are perl5. version is 0.1.59. the start symbol is
grammar.
Reserved Words
All the reserved words and names in the Marpa grammar description language. are always entirely lower case, even at the beginning of a sentence. This is in line with the position Larry Wall took in his 2007 "State of the Onion" talk. The idea is that what the user is doing should be emphasized over the framework of the language.
User Specified Names
User specified names may contain any mix of both upper- and lower-case and are case-indifferent. That is, "Symbol" is the same user-specified name as "SYMBOL", "symbol" and even "sYmBoL". The user may use case as an expressive element, or to distinguish his names from Marpa's keywords.
Names may be more than one word and may be separated by whitespace or hyphens as the user chooses. User names are separation-indifferent as well as case-indifferent. "My symbol" and "my-symbol" are the same name.
A user specified name may be all lowercase, just like one of Marpa's keywords. This allows the user to reuse the Marpa description's keywords for his purposes. Marpa is far more aware of context than most parsers and context will often determine which is the correct choice. When the choice is ambiguous, the reserved word takes precedence. The user can force the word to be regarded a part of a user name by capitalizing one or more of its letters.
For example, in Marpa's self-definition, the following occurs:
rhs element: Optional rhs element.
As you'll see below, "optional" is a Marpa keyword which can be meaningful in that context, and if "optional" were lower case in this example, the Marpa's preferred parse would result in the above sentence being interpreted as a rule which states that a rhs element
can consist either of nothing, or else itself. (Circular units rule and null rules are both legal in Marpa and so is combining them in this way. That's not to say it's a good identify for writing a useful or an efficient grammar.)
However, since "Optional" begins with a capital letter, it must be a user name or part of one. And in fact, an "Optional rhs element" is defined later as
optional rhs element: /optional/, rhs symbol specifier.
As will be explained in more detail below, this states that an optional rhs element
is not an optional element at all. It an non-nullable and non-optional element, composed of the keyword "optional" followed by a rhs symbol specifier
.
Note that on the left hand side of this second rule, the "optional" in optional rhs element
is not capitalized. The "optional" keyword only makes sense on the right hand side. Marpa only interprets keywords as keywords when they "make sense" in context.
This is the same communication strategy we use in human languages. and it helps make human languages compact and expressive. We often use the same word to mean two different things, relying on context to resolve any ambiguity.
The parsing tools in general use today allow the user to use ambiguity only to a very limited degree. In Perl 5, Larry Wall was probably about as aggressive in the use of overloading and ambiguity as anyone using an LALR parser could be.
Literal strings
Literal strings can be single quoted, double quoted, q
-quoted or qq
-quoted. The syntax is much the same in Perl 5. Marpa recognizes backslashes, and will not terminate a single- or double-quoted string at a delimiting quote when it is preceded by a backslash.
For the q
and qq
strings, Marpa allows as delimiters everything in the POSIX punct
character class, except backslash and the four right hand side bracketing symbols -- angle and square bracket, curly brace and parentheses. (These are the same restrictions that Perl 5 imposes, or are at least very close.) Backslashes escape the end delimiters in q
and qq
strings, just as they do in single and double-quoted strings.
Like Perl 5, Marpa treats a q
or qq
string with a left-hand bracketing symbol as a opening delimiter as special cases. The corresponding right hand bracketing symbol becomes the end-delimiter. Backslashes escape as usual. Nesting of the brackets within the quote is also tracked and the string will not terminate until there's an unescaped closing bracket at the same nesting level as the opening bracket.
Marpa's literal strings are often Perl 5 code, and it must be remembered that Marpa does not understand Perl 5 syntax. Treatment of brackets and string delimiters is more complex in Perl 5 than described above and Marpa's ideas of how to deal with them are limited to those described above.
This means that in complex cases, such as when an end delimiter appears in a Perl 5 character class, or within a string inside code Marpa's idea of where the closing delimiter is may not correspond to what Perl 5's would be. Users should choose from among Marpa's large variety of the quoting construct one that sidesteps potential issues.
Once a literal string is recognized, it's passed on unaltered to Perl 5 for the usual Perl 5 interpretation. The string literal is passed to Perl 5 delimited in the same way that it was in Marpa. You can, for example, use single and double quotes in the Marpa source file and expect Perl 5's interpretation of that string literal will follow the same rules as if you'd specified it directly to Perl 5.
Literal regexes
Literal regexes may are delimited either by slashes, or by qr-quoting. For example, the definition sentence
the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.
contains a qr-quoted regex, and
terminal sentence: symbol phrase, /matches/, regex, period.
contains a slash delimited regex.
Their actual evaluation is done by Perl 5. All MDL regexes are qr-quoted before they are passed to Perl 5. qr
-quoted regexes are passed as is. Slash delimited regexes have qr-prepended to them and so are passed as qr-quoted slash-delimited regexes.
Treatment of regex delimiters in MDL follows the same rules are for strings, including acceptance as delimiters of all characters in the POSIX "punct" character class except backslashes and right bracketing characters; backslashing; and the special treatment of bracketing delimiters.
DEFINITION SENTENCES
Definition sentences contain the name of a Marpa predefined and its value, separated by the word "is" or "are". The name of the predefined may be preceded by "the".
The following all define the semantics of the grammar to be Perl 5.
semantics are perl5.
perl5 is the semantics.
perl5 is semantics.
the semantics are perl5.
the semantics is perl5.
Note that "is" or "are" always works. Marpa can't be bothered figuring out whether semantics is (are?) really singular or plural, and I think it has the right attitude about this. Marpa is similarly liberal about "is" versus "are" for all names, whether Marpa predefined or user-specified.
The syntax for the other definitions of predefineds obeys the above rules, except where spefically noted otherwise.
Semantics Definition
The semantics definition is not optional. Marpa is ultimately targeted to perl6, and that is considered its "default" semantics, even though it's not currently available. Currently, the only available semantics is perl5
.
I require every Marpa grammar description to contain a line explicitly stating that its semantics are Perl 5, in order to limit problems with old Marpa source files once Perl 6 becomes available and the default.
Version Definition
version is 0.1.59.
This also is not optional, and as long as Marpa is in alpha, the version has to match exactly. This causes me a lot of trouble because all the test cases and examples and the bootstraping code must be edited whenever I up the version number, but nonetheless I regard it as a feature. It forces the user to be aware of version changes. This is essential while Marpa is in alpha, because versions will change frequently, and features will be volatile. There will be no attempt to maintain compatibility from version to version until Marpa goes beta.
Start Symbol Definition
the start symbol is grammar.
Pedantically speaking, the start symbol is optional, in the sense that it may be specified later using the raw interface. But if no start symbol has been specified by precomputation time, Marpa will fail.
String Definitions
Strings are used in Marpa for several important purposes. Many of the definitions require strings. "Actions" (the semantics of the rules) are specified as strings with Perl 5 code in them. Custom lexers can also be specified, and these also are strings containing Perl 5 code.
Here's an example of a string definition:
concatenate lines is q{
my $v_count = scalar @$Parse::Marpa::This::v;
return undef if $v_count <= 0;
join("\n", grep { $_ } @$Parse::Marpa::This::v);
}.
Default Action Definition
You can specify a "default action", that is, an action to be used in rules which do not explicitly specify an action. By default, rules return a return a "no value" (which in Marpa is not the same as an undefined).
Usually, Marpa's "default default" of no value will be the best choice. Marpa's parse evaluations optimize in the presence of "no value" returns. The current version of self.marpa
uses the no-value "default default". Here's the specification of a default action from an earlier version.
concatenate lines is the default action.
In this case, the string specifier is concatenate lines
, the string name defined in an earlier example. In any definition which takes a string at the value, the string may be specified either by name or as a literal string.
Default Null Value Definition
A "null value" is the value returned by an empty production. If an empty production is explicitly specified in the source file, its action becomes the "null value" for the symbol on the left hand side, and is used whenever it is nulled, whether directly through its own empty production, or indirectly through a series of other production which ultimately produce the empty string.
When a symbol's "null value" is not explicitly set, it defaults to a Marpa "no value". This default can be changed. For example, to have all nulled symbols without explicitly set null values evaluate to an at-sign, this definition can be used.
the default null value is q{@}.
Default Lex Prefix Definition
Terminals can be specified as patterns which the Parse::Marpa::Parse::text()
method will automatically search for. Terminals are allowed to have a "prefix", another pattern which is to be checked for before the main pattern, but not treated as part of the terminal. A common use for lex prefixes is the elimination of leading whitespace.
The raw interface allows this pattern to be set separately for each terminal, and this ability will be added to the grammar description language. Where the lex prefix is not explicitly given, it defaults is //, a pattern the pattern which recognizes the empty string -- in effect, a no-op. This default can be reset with a default lex prefix definition sentence. Here's one from self.marpa
:
the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.
Preamble Definition
Actions for rules are evaluation is a special namespace created for each parse object. It's sometimes useful to have globals in this namespace initialized before any actions are run. Preamble definition sentences may be used for this purpose. Values in the preamble definitions are strings.
The preamble definition differs slightly from other definitions. In most definitions, a new value overwrites the old one. If there is more than one preamble definiton, the values are concatenated. Also, the article to be used before the name of the predefined it must be "a" and not "the". As with other definitions, the article may be omitted. Here's a simplified version of the self.marpa
preamble:
a preamble is q{
our %strings;
}.
PRODUCTION PARAGRAPHS
Here's a self-describing example, simplified from self.marpa
:
production paragraph:
non structural production sentences,
production sentence,
non structural production sentences,
optional action sentence,
non structural production sentences.
A production paragraph is characterized by a production sentence. This may optionally be followed by an action sentence. Before and after these may be other "non structural production sentences". In fact, currently the only other sentence allowed in a production paragraph is a priority sentence, but it can go anywhere.
Production Sentence
Again, let's start with a self-description:
production sentence: lhs, /:/, rhs, period.
A production sentence is a Backus-Naur Form production. (This document assumes you know what that is. If you don't, Wikipedia is a great place to start.) A production sentence consists of a left hand side and a right hand side, separated by a colon. The left hand side is the name of a single symbol.
A standard right hand side is a series of symbol names and regex literals, separated by commas, as in the above example, where lhs
and rhs
are symbol names, and /:/
is a regex.
Any symbol name or regex literal in a standard right hand side can be made optional by preceding it with the keyword optional
. We've already seen several examples of those, but here's another:
default action setting:
optional /the/, /default/, /action/, /is/, action specifier.
If this case the regex /the/
is optional, so that the sentences
the default action is do whatever.
and
default action is do whatever.
are both valid default action setting
's if do whatever
has been defined as a string.
NOTES NOT YET PROPERLY INCORPORATED IN THIS DOCUMENT
All Perl code supplied by the user via the Marpa source file by default is run with "use integer" in effect. If the user wants floating point arithmetic she must specify "no integer".
Also, in such code, "use strict" in effect, and all warnings are turned on, except warnings in the "recursion" category.