NAME
MDL - The Marpa Demonstration Language
BEWARE: THIS DOCUMENT IS UNDER CONSTRUCTION
THIS DOCUMENT IS UNDER CONSTRUCTION
OVERVIEW
The Marpa Demonstration Language (MDL) is a language for decribing grammars to Marpa. It's a high-level Marpa grammar interface -- as of this writing, the only one.
While it is Marpa's first high-level interface, it's intended not to have a privileged status within Marpa. Users can write their own high-level interfaces. It would be easy to write a more efficient one than the MDL. (Hint: don't use ambiguous lexing.) It would be also be easy to write a more powerful one. (Just think of some cool feature and add it.)
It may not be all that hard to write a high-level interface to Marpa which is all of faster, easier to use, and more powerful. There are better language designers than me out there. Maybe you're one of them. If so, Marpa was written to empower you.
Not all parsers can parse languages describing their own grammar. Marpa can. In fact, Marpa's parser starts as the MDL file self.marpa
, and the parser in Marpa is created from it, using the same grammar interface and with the same methods available to the user. I go to some trouble to make sure this claim is really true.
True, an initial bootstrap is needed to create a first generation MDL parser in order to get around chicken-and-egg issues. But the first generation MDL parser then parses self.marpa
a second time to eliminate all traces of the initial bootstrap. The second generation MDL parser then runs self.marpa
into a third generation. Only if the second and third generation parsing code are exactly identical is the Marpa parser considered to be fully self-generated, and only then is it incorporated into a Parse::Marpa distribution.
Most of the examples below are from self.marpa
or previous versions of it.
ELEMENTS OF MARPA GRAMMAR DESCRIPTIONS
Paragraphs and Sentences
An MDL file is divided into paragraphs. Paragraphs are separated by blank lines, that is lines containing only horizontal whitespace. Comments do not count as whitespace for the purpose of separating paragraphs. Paragraphs contain sentences, which must end in a period.
There are definition paragraphs, production paragraphs and terminal paragraphs. Definition paragraphs contain one or more definitions. For example,
semantics are perl5. version is 0.1.59. the start symbol is
grammar.
MDL Keywords
Keywords in MDL are always entirely lower case, even at the beginning of a sentence. This is in keeping with the position Larry Wall took in his 2007 "State of the Onion" talk. The idea is that what the user is doing should be emphasized over the framework of the language.
Note that MDL does not reserve its keywords the way that most computer languages do. A user is free to use MDL keywords for her own purposes, and where context makes the difference clear, may even use them in their all lowercase form.
User Specified Names
User specified names may contain any mix of both upper- and lower-case and are case-indifferent. That is, Symbol
is the same user-specified name as SYMBOL
, symbol
and even sYmBoL
. The user may use case as an expressive element, or to distinguish his names from MDL keywords.
Names may be more than one word and may be separated by whitespace, hyphens, or underscores as the user chooses. Whitespace in this context means any character matching Perl 5's \s
character class. User names are separation-indifferent as well as case-indifferent. My symbol
, MY_SYMBOL
and my-symbol
are all the same name.
A user specified name may be all lowercase, just like one of MDL keywords. This allows the user to reuse the MDL keywords for his purposes. Marpa is far more aware of context than most parsers and context will often determine which is the correct choice. When the choice is ambiguous, the MDL keyword takes precedence. The user can force the word to be regarded as part of a user name by capitalizing one or more of its letters.
For example, in MDL's self-definition, the following occurs:
rhs element: Optional rhs element.
As you'll see below, "optional" is an MDL keyword which can be meaningful in that context, and if "optional" were lower case in this example, the Marpa's preferred parse would result in the above sentence being interpreted as a rule which states that a rhs element
can consist either of nothing, or else itself. (Circular unit rules and null rules are both legal in Marpa, and so is combining them in this way. Which is not to say it's a good idea.)
However, since "Optional" begins with a capital letter, it must be a user name or part of one. And in fact, an "Optional rhs element" is defined later as
optional rhs element: /optional/, rhs symbol specifier.
As will be explained in more detail below, this states that an optional rhs element
is not an optional element at all. It a non-nullable and non-optional element, composed of the keyword optional
followed by a rhs symbol specifier
.
Note that on the left hand side of this second rule, the "optional" in optional rhs element
is not capitalized. The optional
keyword only makes sense on the right hand side. MDL only interprets keywords as keywords where they "make sense".
This is the same communication strategy we use in human languages. It makes human languages compact and expressive. We often use the same word to mean two different things, relying on context to resolve the ambiguity.
The parsing tools in general use today allow the user to use ambiguity only to a very limited degree. In Perl 5, Larry Wall is probably about as aggressive in the use of overloading and ambiguity as anyone using an LALR parser could be.
Literal Strings
Literal strings can be single quoted, double quoted, q
-quoted or qq
-quoted. The syntax is much the same in Perl 5. MDL recognizes backslashes, and will not terminate a single- or double-quoted string at a delimiting quote when it is preceded by a backslash.
For q
and qq
strings, MDL allows as delimiters everything in the POSIX punct
character class, except backslash and the four right hand side bracketing symbols -- angle and square bracket, curly brace and parentheses. (These are the same restrictions that Perl 5 imposes, or at least very close.) Backslashes escape the end delimiters in q
and qq
strings, just as they do in single and double-quoted strings.
Like Perl 5, MDL treats a q
or qq
string with a left-hand bracketing symbol for its opening delimiter as a special case. The corresponding right hand bracketing symbol becomes the end-delimiter. Backslashes escape as usual. Nesting of the brackets within the quote is also tracked and the string will not terminate until there's an unescaped closing bracket at the same nesting level as the opening bracket.
MDL's literal strings are often Perl 5 code, and it must be remembered that MDL does not understand Perl 5 syntax. This means that in complex cases, such as when an end delimiter appears in a Perl 5 character class, or within a string inside code, MDL's idea of where the closing delimiter is may not correspond to what Perl 5's would be. Users should choose from among MDL's large variety of quoting constructs one that sidesteps potential issues.
Once a literal string is recognized, it's passed to Perl 5, unaltered and delimited in the same way that it was in MDL. Therefore if you, for example, use double quotes in the MDL source file, you can expect Perl 5's interpretation of that string literal to follow the same rules as if you'd specified that string directly to Perl 5.
Literal Regexes
Literal regexes may are delimited either by slashes, or by qr-quoting. For example, the definition sentence
the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.
contains a qr-quoted regex, and
terminal sentence: symbol phrase, /matches/, regex, period.
contains a slash delimited regex.
The actual evaluation of regex is done by Perl 5, and it is always done with the qr-operator. MDL qr
-quoted regexes are passed as is. MDL slash-delimited regexes have qr-prepended to them, so that Perl 5 sees them as qr-quoted slash-delimited regexes.
Treatment of regex delimiters in MDL follows the same rules as for strings, including acceptance as delimiters of all characters in the POSIX "punct" character class except backslashes and right bracketing characters; backslashing; and the special treatment of bracketing delimiters.
Perl 5 regex modifiers are recognized after the closing delimiter, the same as in Perl 5. The following modifiers are recognized: msixpo
. They are passed on to Perl 5 exactly as and where found. Here's an example of the m
modifier in use:
empty line matches qr/^[ \t]*\n/m.
DEFINITION SENTENCES
Definition sentences contain the name of an MDL predefined and its value, separated by the word is
or are
. The name of the predefined may be preceded by the
.
The following all define the semantics of the grammar to be Perl 5.
semantics are perl5.
perl5 is the semantics.
perl5 is semantics.
the semantics are perl5.
the semantics is perl5.
Note that is
or are
always works. MDL can't be bothered figuring out whether semantics is (are?) really singular or plural, and I don't blame it. MDL is similarly nonchalant about is
versus are
for all names, whether they are MDL predefineds or user-specified.
The syntax for all definition sentences obeys the above rules, except as specifically noted otherwise.
Semantics Definition
The semantics definition is not optional. Marpa is ultimately targeted to Perl 6, and that is considered its "default" semantics, though it's not currently available. Currently, the only available semantics is perl5
.
I require all MDL source to contain a line explicitly stating that its semantics are Perl 5, in order to limit problems with old MDL files once Perl 6 becomes available and the default.
Version Definition
version is 0.1.59.
The version definition is also not optional, and as long as Marpa is in alpha, the version has to match exactly. This causes me a lot of trouble because all the test cases, examples, bootstrapping code, etc., etc., must be edited whenever I up the version number, but nonetheless I regard it as a feature. It forces the user to be aware of version changes. This is essential while Marpa is in alpha, because versions will change frequently, and features will be volatile. There will be no attempt to maintain compatibility from version to version until Marpa goes beta.
Start Symbol Definition
the start symbol is grammar.
Pedantically speaking, the start symbol is optional, in the sense that it may be specified later using the raw interface. But if no start symbol has been specified by precomputation time, Marpa will fail.
String Definitions
Strings are used in Marpa for several important purposes. Many of the definitions require strings. "Actions" (the semantics of the rules) are specified as strings with Perl 5 code in them. Custom lexers can also be specified, and these also are strings containing Perl 5 code.
Here's an example of a string definition:
concatenate lines is q{
my $v_count = scalar @$Parse::Marpa::Read_only::v;
return undef if $v_count <= 0;
join("\n", grep { $_ } @$Parse::Marpa::Read_only::v);
}.
Default Action Definition
You can specify a "default action", that is, an action to be used in rules which do not explicitly specify an action. By default, rules return undefined.
The current version of self.marpa
uses the default. Here's the specification of a default action from an earlier version.
concatenate lines is the default action.
In this case, the string specifier is concatenate lines
, the string name defined in an earlier example. In any definition which takes a string as its value, the string may be specified either by name or as a literal string.
Default Null Value Definition
A "null value" is the value returned when a symbol matches the empty string. Whenever an empty rule is explicitly specified to Marpa, and an action is given for it, that action becomes not only the action for that rule, but the "null value" of the symbol on the left hand side of that rule. The "null value" of the symbol is its value whenever it is nulled, whether directly as a result of its own empty rule, or indirectly through a series of other rules.
When a symbol's "null value" is not set with an empty rule, it defaults to a Perl undefined. The default can be changed. For example, the following setting has nulled symbols default to an at-sign.
the default null value is q{@}.
Default Lex Prefix Definition
Terminals can be specified as patterns which the Parse::Marpa::Parse::text()
method will automatically search for. Terminals are allowed to have a "prefix", another pattern which is to be checked for before the main pattern, but not treated as part of the terminal. A common use for lex prefixes is the elimination of leading whitespace.
The raw interface allows this pattern to be set separately for each terminal, and this ability will be added to the MDL. By default, there is not lex prefix for a terminal. (Or saying the same thing more pedantically, the default is a lex prefix is a regular expression which the empty string and only the empty string.) This default can be reset with a default lex prefix definition sentence. Here's one from self.marpa
:
the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.
Preamble Definition
Actions for rules are evaluated is a special namespace created for each parse object. It's sometimes useful to have globals in this namespace initialized before any actions are run. Preamble definition sentences may be used for this purpose. Values in the preamble definitions are strings.
The preamble definition differs slightly from other definitions. In most definitions, any new value overwrites the old one. If there is more than one preamble definiton, the values are concatenated. Also, the article to be used before the name of the predefined it must be a
and not the
. As with other definitions, the article may be omitted. Here's a simplified version of the self.marpa
preamble:
a preamble is q{
our %strings;
}.
PRODUCTION PARAGRAPHS
Here's a self-describing example, simplified from self.marpa
:
production paragraph:
non structural production sentences,
production sentence,
non structural production sentences,
optional action sentence,
non structural production sentences.
A production paragraph is characterized by a production sentence. This may optionally be followed by an action sentence. Before and after these may be other "non structural production sentences" -- sentences which can go anywhere in the paragraph. The "non structural" sentences are comment sentences and priority sentences.
Production Sentence
BNF Production Sentences
Let's start with a self-description:
production sentence: lhs, /:/, rhs, period.
The usual form of a production sentence is a BNF (Backus-Naur Form) production. [ This document assumes you know what BNF is. If you don't, Wikipedia's article on Backus Naur form (http://en.wikipedia.org/wiki/Backus-Naur_form) is a great place to start. It points out that since Panini gave a precise description of Sanskirt more than 23 centuries earlier in India using a similar notation, it might better be called Panini-Backus Form. ]
A BNF production sentence consists of a left hand side and a right hand side, separated by a colon. The left hand side is the name of a single symbol. The right hand side is a series of symbol names and regex literals, separated by commas, as in the above example, where lhs
and rhs
are symbol names, and /:/
is a regex.
Any symbol name or regex literal in a BNF right hand side can be made optional by preceding it with the keyword optional
. We've already seen several examples of those, but here's another:
default action setting:
optional /the/, /default/, /action/, /is/, action specifier.
In this case the regex /the/
is optional, so that the sentences
the default action is do whatever.
and
default action is do whatever.
are both valid default action setting
's. It is assumed that do whatever
has been defined as a string.
A nullable symbol is one which can produce the empty string. When a nullable symbol is made optional the same thing is done twice at a high cost in efficiency, and Marpa doesn't allow it. The usual fix is to settle on one strategy or the other.
Sequence Production Sentences
Sequences of symbols are extremely common in practical grammars. MDL provides four special forms of the production sentence for dealing with these.
definition paragraph: definition sequence.
non structural production sentences:
optional non structural production sentence sequence.
paragraphs: empty line separated paragraph sequence.
rhs: optional comma separated rhs element sequence.
In sequence production sentences, the right hand side of the production is a sequence description. Only one sequence description is allowed on a right hand side, and it must be the only thing on the right hand side.
This restriction of sequences to sequence productions and of sequence productions to a single sequence is not the result of any limit of the Marpa parser. It would not be hard to allow sequences and optional sequences on the right hand side of any BNF production. What's hard is figuring out how to conveniently specify the semantics in such cases, and that's the reason for the restriction. As the right hand side of a production grows more complex, the semantics becomes more complex to write, more bug-prone, and harder to debug. I'm open to revisiting this and lifting the restriction.
The right hand side of a sequence production can be either a simple sequence or a separated sequence. A simple sequence is designated by a symbol name followed by the keyword sequence
. A separated sequence is designated by the designation of a simple sequence, preceded by a separation specifier. The separation specifier consists of a symbol name for the separator followed by the keyword separated
.
A separation sequence will allow trailing separators, in Perl fashion. For example, a rhs
is specified above as a comma-separated sequence of rhs element
's. Both of the following would parse as valid rhs
's:
optional /the/, /default/, /action/, /is/, action specifier
optional /the/, /default/, /action/, /is/, action specifier,
Preceding any sequence with the keyword optional
means that the entire sequence may be omitted. It might be useful to compare the standard notation for regular expressions: X sequence
is equivalent to X+
and optional X sequence
to X*
.
The value of the child symbols in a sequence production is available to semantic actions in the array referenced by $Parse::Marpa::Read_only::v
. Unlike conventional productions, in sequence productions you don't know in advance the number of right hand side symbols or their indices in the array.
You can deal with the semantics of sequence productions by using operations like join
or map
which work on the entire array. You can also obtain a count of the symbols on the right side and the index of the last symbol in the usual way, that is, with scalar @{$Parse::Marpa::Read_only::v}
and $#$Parse::Marpa::Read_only::v
. Sequence productions work best where the semantics really are those of a sequence of items with similar semantics.
Sequence productions can always be rewritten as BNF productions. In fact, most parsers don't have sequence productions, forcing you to write your sequences as BNF productions. If your right hand side has does not have sequence semantics, you may wish to do exactly that.
Having nullable symbols (symbols which might match an empty string) in a sequence is not a good idea, and Marpa throws an exception if you try it. If you define an X
as a string of one or more non-nullable Y
's, Marpa can recognize a single parse for the string "YY"
. Now suppose Y
is nullable. How many null Y
's are there in "YY"
, and where are they? A nullable in a sequence is almost always an error in grammar design.
But not always. Marpa's most basic test case is exactly that: creation of a sequence of nullables and a tally of the exponentially increasing results. The ban on nullables is only on sequences in sequence productions. Nothing prevents you from specifying a sequence of nullables with conventional productions, either for stress testing or any other use you can find.
Action Sentences
Action sentences describe a semantic action. In MDL, actions are strings containing Perl 5 code. For example, in self.marpa
:
action sentence:
optional /the/, /action/, /is/, action specifier, period.
q{
" action => "
. $Parse::Marpa::Read_only::v->[3]
}.
the last four lines are an action sentence. You can also be more verbose.
action sentence: action specifier, period.
the action is q{
" action => "
. $Parse::Marpa::Read_only::v->[0]
}.
As a reminder, strings are delimited as MDL strings, not as Perl 5 code, even when they are destined to be compiled as Perl 5 code. Many constructs which "escape" the delimiters in Perl 5 code will not do so in an MDL string.
The last value computed in the action will be the value returned by the production. Values from child productions are available through $Parse::Marpa::Read_only::v
, a reference to an array of them. The array contains the value for the symbols of the right hand side, in order. In both the examples above, a string is prepended to the value of the action specifier
, and that becomes the value of action sentence
in any parent production.
All actions start with use integer
and use strict
in effect. All Perl 5 warnings are also in effect except those in the "recursion" category. The code is run in a special namespace special to each parse object, but common to all the user code run within it. Globals in that namespace may be initialized in preamble definitions.
Priority Sentence
A priority sentence is the keyword priority
followed by an integer value, which may be negative. The default priority is 0.
definition: predefined setting, period. q{ $Parse::Marpa::Read_Only::v->[0] }. priority 1000.
In the example above, the rule that produces a predefined setting
from a definition
is given a priority of 1000.
Priorities are used to affect the order in which ambiguous parses are returned, and they are a very important tool in Marpa. The priority setting in the above example directly affects the parsing of this sentence:
a preamble is q{ our $var; }.
This sentence has two parses in self.marpa
. It could be parsed as a string definition, defining the string named a-preamble
. But a
and preamble
could also be recognized as MDL keywords, and in that case the sentence is a setting for Marpa's preamble
predefined, and the literal string would be concatenated to the current value of preamble
predefined.
The ambiguity is resolved by giving predefined setting
's, such as the preamble
setting, a higher priority than string definitions. The above example sets the priority for definition
's which are predefined setting
's to 1000. This is the production paragraph for producing a string definition
from a definition
:
definition: string definition. concatenate lines.
There's no priority sentence here, so string definitions in self.grammar
get the default priority, which is zero.
That means that if a sentence in a definition paragraph can be interpreted both as a setting and a string definition, the setting definition will be the first parse. Only if a definition sentence cannot be interpreted as a setting, will a string definition be the first parse. In many applications, only one parse is used, and when that's the case, whichever parse is first becomes the only parse.
Comment Sentences
Comment sentences may occur anywhere in a production paragraph. They consist of a comment tag, followed by a colon and a series of whitespace-separated "comment words". Comment words may contain anything except whitespace or a period. The comment tags are "to do", "note" and "comment".
note: I may look like a production sentence; but MDL knows the difference.
To force MDL to treat something that looks like a comment sentence as a production, capitalize any letter in the comment tag. To force MDL to treat something that looks like a production as a comment sentence, put some syntax which is illegal in a production sentence anywhere in the comment. The above example has a semi-colon, which no form of any production sentence allows.
TERMINAL PARAGRAPHS
A terminal paragraph is characterized by a terminal sentence. It may also contain comment sentences, as described above. Terminal sentences come in two forms, the regex form and the closure form.
Regex Form of Terminal Sentence
The regex form of a terminal sentence is a symbol name, the keyword matches
, and a regex literal. For example,
comma matches qr/\,/.
These regexes must make no assumptions about where in a string, or where relative to the pos
position, they will be applied. This means that the only anchoring allowed is "^
" and "$
" when the "m"
modifier is in use, so that "^
" and "$
" are able to match start and end of line anywhere in a string. I don't think you'll really want any other forms of anchoring.
Captures should be avoided for efficiency reasons, but the only ones that are actually banned are named captures to buffers with names beginning with "MARPA_
" or any of its capitalization variants ("marpa_
", "MaRpA_
", etc.). These buffers are reserved for Marpa's use.
Closure Form of Terminal Sentence
The closure form of a terminal sentence is the keyword match
; a symbol name; the keyword using
; and finally, a string name or a string literal. For example,
note: change this to use lex_q_quote sometime(?).
match single quoted string using q{
my $match_start = pos $$STRING;
state $prefix_regex = qr/\G'/o;
return unless $$STRING =~ /$prefix_regex/g;
state $regex = qr/\G[^'\0134]*('|\0134')/;
MATCH: while ($$STRING =~ /$regex/gc) {
next MATCH unless defined $1;
if ($1 eq q{'}) {
my $end_pos = pos $$STRING;
my $match_length = $end_pos - $match_start;
my $lex_length = $end_pos - $START;
return (substr($$STRING, $match_start, $match_length), $lex_length);
}
}
return;
}.
The string may contain Perl 5 code specifying a lexing closure, or be one of the special strings lex_q_quote
or lex_regex
.
If the string is lex_q_quote
Marpa uses an internal routine that implements the q-quoting rules described above. If the string is lex_regex
Marpa uses as the closure an internal routine that implements the rules for literal regexes also described above. This means the terminal paragraph for the regex
terminal can be quite simple:
match regex using "lex_regex".
The comment in the previous example notes the code in that literal string could be replaced using an internal routine. It would make self.marpa
simpler. For Marpa to internalize the closures used by MDL makes sense from a number of points of view. It's slightly more efficient and the internalized routines then become conveniently available to other high-level grammar interfaces. But Marpa self-generates from self.marpa
and in that self-generation I like to have a test of lexing closures. If I internalize them all to Marpa, I'll lose that test.
The lexing closure gets the string to be matched as a pos
offset within the string referred to by $STRING
. If you don't know what a pos
offset is, or what \G
does in a Perl 5 regular expression, you will need to go back to the Perl documentation. To write a lexing closure you'll need to know it cold.
$STRING
is a reference because the strings Marpa lexes will often be entire files, and may be very long. Similarly, chopping the string up as each token is lexed would be very inefficient. So Marpa uses the same mechanism used by regexes with the g
modifier to track and move the position for matches.
In writing a lexing closure, keep the following in mind:
Simply matching
$$STRING
won't work, and will cause your parser to fail in almost every case.Always match from the current setting of
pos $STRING
, which is also accessible with\G
anchoring.$STRING
is a reference, and for efficiency should be used that way. The string may be many megabytes long. Avoid any construct that copies it, unless you feel sure your grammar will only be used in applications for which the efficiency hit will be acceptable.Don't change
$$STRING
, or$STRING
. Please.Changes to
pos $$STRING
are OK. You do not need to worry about where your code leaves thepos
position in the string. Marpa moves it around anyway. Marpa will use the length you returned for the match to setpos $$STRING
when it wants to move lexing forward.Return failure with a simple return. If you really need to throw an exception, use
croak()
.On success, return an array of two elements. The first element must be the value of the token at the match. This is often the matched substring, but it can be anything useful. The second element of the array is the length of the token in earlemes (see the CONCEPTS document about earlemes). Lexing closures in MDL can assume that they'll be called via
Parse::Marpa::Parse::text()
and that a one-character-per-earleme model is in use.Negative-length tokens are not allowed. Less obviously, zero-length tokens will mess Marpa up just as totally. Token lengths must be greater than zero.
Lexing closures run in the same namespace as semantic actions, and therefore also see any globals set up in the preamble. "use strict", "use integer" and all warnings except recursion are in effect.
The token length returned under MDL in any grammar interface with a one-character-per-earleme model, must include the length of any prefix removed. For convenience, $START is set to the position before the prefix. If a non-empty prefix was removed, returning just the length of the matched string will cause the lexing to get out of sync and parsing to fail. The code in the above example shows things being done correctly.