NAME
MDL - The Marpa Demonstration Language
OVERVIEW
The Marpa Demonstration Language (MDL) is a language for decribing grammars to Marpa. It's a high-level Marpa grammar interface -- as of this writing, the only one.
While it is Marpa's first high-level interface, it's intended not to have a privileged status within Marpa. Users can write their own high-level interfaces. It would be easy to write one faster than the MDL. (Hint: don't use ambiguous lexing.) You can also write one that's more powerful. (Just think of some cool feature and add it.)
Some of you might be able to write a high-level interface to Marpa which is just plain nicer than MDL. There are better language designers than me out there. Maybe you're one of them. If so, Marpa was written to empower you.
SELF-PARSING
Not all parsers can parse a language that describes their input grammars. Marpa can. In fact, Marpa's parser starts as the MDL file self.marpa
, and the parser in Marpa is created from it, using the same grammar interface and with the same methods available to the user. I go to some trouble to make sure this claim is really true.
True, an initial bootstrap is needed to create a first generation MDL parser in order to get around chicken-and-egg issues. But the first generation MDL parser then parses self.marpa
a second time to eliminate all traces of the initial bootstrap. The second generation MDL parser then runs self.marpa
into a third generation. Only if the second and third generation parsing code are exactly identical is the Marpa parser considered to be fully self-generated, and only then is it incorporated into a Parse::Marpa distribution.
Most of the examples below are from self.marpa
or previous versions of it.
ELEMENTS OF MARPA GRAMMAR DESCRIPTIONS
Paragraphs and Sentences
An MDL file is divided into paragraphs. Paragraphs are separated by blank lines, that is lines containing only horizontal whitespace. Comments do not count as whitespace for the purpose of separating paragraphs. Paragraphs contain sentences, which must end in a period.
There are definition paragraphs, production paragraphs and terminal paragraphs. Here's an example of a definition paragraph:
semantics are perl5. version is 0.1.59. the start symbol is
grammar.
MDL Keywords
Keywords in MDL are always entirely lower case, even at the beginning of a sentence. This is in keeping with the position Larry Wall took in his 2007 "State of the Onion" talk. The idea is that what the user is doing should be emphasized over the framework of the language.
Note that MDL does not reserve its keywords the way that most computer languages do. A user is free to use MDL keywords for her own purposes, and when context makes the difference clear, may use them in their all lowercase form.
User Specified Names
User specified names may contain any mix of both upper- and lower-case and are case-indifferent. That is, Symbol
is the same user-specified name as SYMBOL
, symbol
and even sYmBoL
. The user may use case as an expressive element, or to distinguish his names from MDL keywords.
Names may be more than one word and may be separated by whitespace, hyphens, or underscores as the user chooses. User names are separation-indifferent as well as case-indifferent. My symbol
, MY_SYMBOL
and my-symbol
are all the same name.
A user specified name may be all lowercase, just like one of the MDL keywords. This allows the user to reuse the MDL keywords for his purposes. Marpa is far more aware of context than most parsers and context can often determine which is the correct choice. When the choice is ambiguous, the MDL keyword takes precedence. The user can force the word to be regarded as part of a user name by capitalizing one or more of its letters.
For example, in MDL's self-definition, the following occurs:
rhs element: Optional rhs element.
As you'll see below, "optional" is an MDL keyword which can be meaningful in that context, and if "optional" were lower case in this example, the Marpa's preferred parse would result in the above sentence being interpreted as a rule which states that a rhs element
can consist either of itself or nothing. (Cycles are not allowed in Marpa, so this MDL would cause Marpa to fail with an error message.)
However, since "Optional" begins with a capital letter, it must be a user name or part of one. And in fact, an "Optional rhs element" is defined later as
optional rhs element: /optional/, rhs symbol specifier.
As will be explained in more detail below, this states that an optional rhs element
is not an optional element at all. It a non-nullable and non-optional element, composed of the keyword optional
followed by a rhs symbol specifier
.
Note that on the left hand side of this second rule, the "optional" in optional rhs element
is not capitalized. The optional
keyword only makes sense on the right hand side. MDL only interprets keywords as keywords where they "make sense".
In human languages, we often use the same word to mean two different things, relying on context to resolve the ambiguity. It makes human languages compact and expressive. The parsing tools in general use today allow the user to use ambiguity only to a very limited degree. In Perl 5, Larry Wall is probably about as aggressive in the use of overloading and ambiguity as anyone using an LALR parser could be.
Literal Strings
Literal strings can be single quoted, double quoted, q
-quoted or qq
-quoted. The syntax is much the same as in Perl 5. MDL recognizes backslashes, and will not terminate a single- or double-quoted string at a delimiting quote when it is preceded by a backslash.
For q
and qq
strings, MDL allows as delimiters everything in the POSIX punct
character class except backslash and the four right hand side bracketing symbols -- angle and square bracket, curly brace and parentheses. (These are the same restrictions that Perl 5 imposes, or at least very close.) Backslashes escape the end delimiters in q
and qq
strings, just as they do in single and double-quoted strings.
Like Perl 5, MDL treats a q
or qq
string with a left-hand bracketing symbol for its opening delimiter as a special case. The corresponding right hand bracketing symbol becomes the end-delimiter. Backslashes escape as usual. Nesting of the brackets within the quote is also tracked and the string will not terminate until there's an unescaped closing bracket at the same nesting level as the opening bracket.
MDL's literal strings are often Perl 5 code, and it must be remembered that MDL does not understand Perl 5 syntax. This means that in complex cases, such as when an end delimiter appears in a Perl 5 character class, or within a string inside code, MDL's idea of where the closing delimiter is may not correspond to what Perl 5's would be. Users should choose, from among MDL's large variety of quoting constructs, one that sidesteps potential issues.
Once a literal string is recognized, it's passed to Perl 5, unaltered and delimited in the same way that it was in MDL. If you, for example, use double quotes in the MDL source file, you can expect Perl 5's interpretation of that string literal to follow the same rules as if you'd specified that string directly to Perl 5.
Literal Regexes
Literal regexes are delimited either by slashes, or by qr
-quoting. For example, the definition sentence
the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.
contains a qr
-quoted regex, and
terminal sentence: symbol phrase, /matches/, regex, period.
contains a slash delimited regex.
The actual evaluation of regexes is done by Perl 5, and it is always done with the qr
-operator. MDL qr
-quoted regexes are passed as is. MDL slash-delimited regexes have qr
prepended to them, so that Perl 5 sees them as qr
-quoted slash-delimited regexes.
Delimiters in MDL regexes follows the same rules as in MDL strings. Regexes accept all characters in the POSIX "punct" character class except backslashes and right bracketing characters as delimiters. Backslashing escapes delimiters in regexes. And bracketing delimiters get the same special treatment as in strings.
Perl 5 regex modifiers are recognized after the closing delimiter, just as in Perl 5. The following modifiers are recognized: msixpo
. They are passed on to Perl 5 exactly as and where found. Here's an example of the m
modifier in use:
empty line matches qr/^[ \t]*\n/m.
DEFINITION PARAGRAPHS
Definition paragraphs contain one for more definition sentences. Definition sentences contain the name of an MDL predefined and its value, separated by the word is
or are
. The name of the predefined may be preceded by the
.
The following sentences all define the semantics of the grammar to be Perl 5.
semantics are perl5.
perl5 is the semantics.
perl5 is semantics.
the semantics are perl5.
the semantics is perl5.
Note that is
or are
always works. MDL can't be bothered figuring out whether semantics is (are?) really singular or plural, and I don't blame it. MDL is similarly nonchalant about is
versus are
for all names, whether they are MDL predefineds or user-specified.
The syntax for all definition sentences obeys the above rules, except as specifically noted otherwise. Each definition takes its own set of values, as described below.
Many definitions takes strings as their value. Where that's the case, the string may be specified either by name or as a literal string.
Semantics Definition
The semantics definition is not optional. Marpa is ultimately targeted to Perl 6, and that is considered its "default" semantics, even though it's not currently available. Currently, the only available semantics is perl5
.
All MDL source requires a line explicitly stating that its semantics are Perl 5, in order to limit problems with old MDL files once Perl 6 becomes available and the default.
Version Definition
version is 0.1.59.
The version definition is also not optional, and as long as Marpa is in alpha, the version has to match exactly. While Marpa is in alpha, versions will change frequently, and features will be volatile. There will be no attempt to maintain compatibility from version to version until Marpa goes beta.
Start Symbol Definition
the start symbol is grammar.
Pedantically speaking, the start symbol is optional, in the sense that it may be specified later using the raw interface. But if no start symbol has been specified by precomputation time, Marpa will fail.
String Definitions
Strings are important in Marpa. In Marpa, semantic actions are specified by strings with Perl 5 code in them.
Here's an example of a string definition:
concatenate lines is q{
my $v_count = scalar @$Parse::Marpa::Read_only::v;
return undef if $v_count <= 0;
join("\n", grep { $_ } @$Parse::Marpa::Read_only::v);
}.
Default Action Definition
You can specify a "default action", that is, an action to be used in rules which do not explicitly specify an action. By default, rules return undefined. The current version of self.marpa
uses this default. Here's the specification of a default action from an earlier version:
concatenate lines is the default action.
In this case, the string specifier is concatenate lines
, the string name defined above.
Default Null Value Definition
A "null value" is the value returned when a symbol matches the empty string. Whenever an empty rule is explicitly specified to Marpa, and an action is given for it, that action is used to calculate the null value of the symbol on the left hand side of that rule. The null value of a symbol is its value whenever it is nulled, whether directly as a result of its own empty rule, or indirectly through a series of other rules.
When a null symbol value is not set explicitly in an empty rule, it defaults to a Perl undefined. This default can be changed. For example, the following changes the default null symbol value to an at-sign.
the default null value is q{@}.
Default Lex Prefix Definition
Terminals can be specified as patterns which the Parse::Marpa::Parse::text()
method will automatically search for. Terminals are allowed to have a "prefix", another pattern which is to be checked for and discarded before the main pattern. A common use for lex prefixes is the elimination of leading whitespace.
The raw interface allows a separate lex prefix to be set for each terminal, and this ability will be added to the MDL. By default, there is no lex prefix for a terminal. (Saying the same thing more pedantically, the default lex prefix is a regular expression which matches the empty string and only the empty string.) This default can be reset with a default lex prefix definition sentence. Here's one from self.marpa
:
the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.
Preamble Definition
Lex actions and rule actions are evaluated is a special namespace created for each parse object. It's sometimes useful to have globals in this namespace initialized before any actions are run. Preamble definition sentences may be used for this purpose. Values in the preamble definitions are strings.
If there is more than one preamble definition in MDL, the values are concatenated. This differ from most definitions in MDL, where any new value overwrites the old one. Also, the article to be used before the name of the predefined it must be a
and not the
. As with other definitions, the article may be omitted. This is the self.marpa
preamble:
a preamble is q{
our $regex_data = [];
}.
PRODUCTION PARAGRAPHS
Here's a self-describing example of a production paragraph, simplified from self.marpa
:
production paragraph:
non structural production sentences,
production sentence,
non structural production sentences,
optional action sentence,
non structural production sentences.
A production paragraph is characterized by a production sentence. This is followed by an optional action sentence. Before and after these may be "non structural production sentences" -- sentences which can go anywhere in the paragraph. Comment sentences and priority sentences are "non structural" sentences.
Production Sentence
Production sentences come in two forms: BNF sentences and sequence sentences.
BNF Sentences
Let's start with another self-description:
production sentence: lhs, /:/, rhs, period.
The usual form of a production sentence is a BNF (Backus-Naur Form) production. (The Marpa documents assume you know what BNF is. If you don't, a good start is Wikipedia's article on Backus Naur form.)
A BNF sentence consists of a left hand side and a right hand side, separated by a colon. The left hand side is the name of a single symbol. The right hand side is a series of symbol names and regex literals, separated by commas, as in the above example, where lhs
and rhs
are symbol names, and /:/
is a regex.
Any symbol name or regex literal in a BNF right hand side can be made optional by preceding it with the keyword optional
. We've already seen several examples of those, but here's another:
default action setting:
optional /the/, /default/, /action/, /is/, action specifier.
In this case the regex /the/
is optional, so that the sentences
the default action is do whatever.
and
default action is do whatever.
are both valid default action setting
's. It is assumed that do whatever
has been defined as a string.
Marpa doesn't allow you to make a symbol optional, if it is nullable. A symbol is nullable if it can produce the empty string. In effect, a nullable symbol already is optional. Making it optional again with the optional
keyword does the same thing twice and would cause a lot of useless wheels inside Marpa.
Sequence Sentences
Sequences of symbols are extremely common in useful grammars. MDL provides four special forms of the production sentence for dealing with these.
definition paragraph: definition sequence.
non structural production sentences:
optional non structural production sentence sequence.
paragraphs: empty line separated paragraph sequence.
rhs: optional comma separated rhs element sequence.
In sequence sentences, the right hand side of the production is a sequence description. Only one sequence description is allowed on a right hand side, and it must be the only thing on the right hand side.
This restriction of sequences to sequence productions and of sequence productions to a single sequence is not the result of any limit of the Marpa parse engine. It would not be hard to allow many sequences and optional sequences on the right hand side of any BNF production. What's hard is figuring out how to conveniently specify their semantics. As the right hand side of a production grows more complex, the semantics becomes more complex to write, more bug-prone, and harder to debug. (I'm open to revisiting this issue and lifting the restriction.)
The right hand side of a sequence production can be either a simple sequence or a separated sequence. A simple sequence is designated by the symbol name of the symbol to be repeated, followed by the keyword sequence
. A separated sequence is designated by the symbol name of a separator followed by the keyword separated
, then the name of the repeated symbol followed by the keyword sequence
.
A separation sequence will allow trailing separators, in Perl fashion. For example, a rhs
is specified above as a comma-separated sequence of rhs element
's. Both of the following would parse as valid rhs
's:
optional /the/, /default/, /action/, /is/, action specifier
optional /the/, /default/, /action/, /is/, action specifier,
Preceding any sequence with the keyword optional
means that the entire sequence may be omitted. In the standard notation for regular expressions X sequence
would be X+
and optional X sequence
would be X*
.
The value of the child symbols in a sequence sentence is available to semantic actions in the array referenced by $Parse::Marpa::Read_Only::v
, just as in a BNF sentence. Unlike in a BNF sentence, in a sequence sentence you don't know in advance the number of right hand side symbols.
You can deal with the semantics of sequence sentences by using operations like join
or map
which work on the entire array. You can also obtain a count of the symbols on the right side in the usual way, with scalar @{$Parse::Marpa::Read_Only::v}
. The index of the last symbol is available as $#$Parse::Marpa::Read_Only::v
. Sequence sentences work best where the semantics really is that of a sequence of semantically equivalent items.
Sequences can always be written as BNF. In fact, most parser generators don't have explicit sequence constructors. If your semantics don't work conveniently as sequence semantics, you may wish to write your sequence out in BNF.
Having nullable symbols (symbols which might match an empty string) in a sequence is not a good idea, and Marpa throws an exception if you try it. If you define an X
as a string of one or more non-nullable Y
's, Marpa can recognize a single parse for the string "YY"
. Now suppose Y
is nullable. How many null Y
's are there in "YY"
, and where are they? A nullable in a sequence is almost always an error in grammar design.
But not always. Marpa's most basic test case is exactly that: creation of a sequence of nullables and a tally of the exponentially increasing results. The ban on nullables in sequences only applies to sequences in sequence productions. Nothing prevents you from specifying a sequence of nullables with BNF productions for stress testing or for any other use you can find.
Action Sentences
Action sentences describe a semantic action. In MDL, actions are strings containing Perl 5 code. For example, in self.marpa
:
action sentence:
optional /the/, /action/, /is/, action specifier, period.
q{
" action => "
. $Parse::Marpa::Read_only::v->[3]
}.
the last four lines are an action sentence. You can also be more verbose.
action sentence: action specifier, period.
the action is q{
" action => "
. $Parse::Marpa::Read_only::v->[0]
}.
The action can be either a rule action or a null symbol action. Which it is depends on the production sentence in the paragraph. If the production sentence is a BNF sentence with an empty right hand side, the action is the null symbol action for the symbol on the left hand side. Otherwise, the action is a rule action.
As a reminder, strings are delimited as MDL strings, not as Perl 5 code, even when they are destined to be compiled as Perl 5 code. Many constructs which "escape" the delimiters in Perl 5 code will not do so in an MDL string.
The last value computed in the action will be the value returned by the production. Values from child productions are available through $Parse::Marpa::Read_only::v
, a reference to an array of them. The array contains the values of the symbols on the right hand side, in order. In both the examples above, a string is prepended to the value of the action specifier
, and that becomes the value of action sentence
in any parent production.
All actions start with use integer
and use strict
in effect. All Perl 5 warnings are also in effect except those in the "recursion" category. The code is run in a special namespace special to each parse object, but common to all the user code run within the parse object. Globals in that namespace may be initialized in preamble definitions.
Priority Sentence
A priority sentence is the keyword priority
followed by an integer value, which may be negative. The default priority is 0.
definition: predefined setting, period.
q{ $Parse::Marpa::Read_Only::v->[0] }.
priority 1000.
In the example above, the rule that produces a predefined setting
from a definition
is given a priority of 1000.
Priorities are used to affect the order in which ambiguous parses are returned, and they are a very important tool in Marpa. The priority setting in the above example directly affects the parsing of this sentence:
a preamble is q{ our $var; }.
This sentence has two parses in self.marpa
. It could be parsed as a string definition, one defining the string named a-preamble
. But a
and preamble
could also be recognized as MDL keywords, and in that case the sentence is a setting for Marpa's preamble
predefined, and the literal string would be concatenated to the current value of the preamble.
The ambiguity is resolved by giving predefined setting
's, such as the preamble
setting, a higher priority than string definitions. The above example sets the priority for definition
's which match predefined setting
's to 1000. Here's the production paragraph for producing a string definition
from a definition
:
definition: string definition. concatenate lines.
There's no priority sentence here, so string definitions get the default priority, which is zero.
That means that if a sentence in a definition paragraph can be interpreted both as a setting and a string definition, the setting's definition will be the first parse. Only if a definition sentence cannot be interpreted as a setting, will the string definition become the first parse. In MDL, as in most applications, only one parse is used. The first parse is the only parse.
Comment Sentences
Comment sentences may occur anywhere in a production paragraph. They consist of a comment tag, followed by a colon and a series of whitespace-separated "comment words". Comment words may contain anything except whitespace or a period. The comment tags are "to do", "note" and "comment".
note: I may look like a production sentence; but MDL knows the difference.
To force MDL to treat something that looks like a comment sentence as a production, capitalize any letter in the comment tag. To force MDL to treat something that looks like a production as a comment sentence, put some syntax which is illegal in a production sentence anywhere in the comment. The above example has a semi-colon, which no form of any production sentence allows.
TERMINAL PARAGRAPHS
A terminal paragraph is characterized by a terminal sentence. It may also contain comment sentences, as described above. Terminal sentences come in two forms, the regex form and the closure form.
Regex Form of Terminal Sentence
The regex form of a terminal sentence is a symbol name, the keyword matches
, and a regex literal. For example,
comma matches qr/\,/.
These regexes must make no assumptions about where in a string, or where relative to the pos
position, they will be applied. This means that the only anchoring allowed is "^
" and "$
" when the "m"
modifier is in use, so that "^
" and "$
" are able to match start and end of line anywhere in a string. I don't think you'll really want any other forms of anchoring.
Captures should be avoided for efficiency reasons, but the only ones that are actually banned are named captures to buffers with names beginning with "MARPA_
" or any of its capitalization variants ("marpa_
", "MaRpA_
", etc.). These buffers are reserved for Marpa's use.
Closure Form of Terminal Sentence
The closure form of a terminal sentence is the keyword match
; a symbol name; the keyword using
; and finally, a string name or a string literal. For example,
note: change this to use lex_q_quote sometime(?).
match single quoted string using q{
my $match_start = pos $$STRING;
state $prefix_regex = qr/\G'/o;
return unless $$STRING =~ /$prefix_regex/g;
state $regex = qr/\G[^'\0134]*('|\0134')/;
MATCH: while ($$STRING =~ /$regex/gc) {
next MATCH unless defined $1;
if ($1 eq q{'}) {
my $end_pos = pos $$STRING;
my $match_length = $end_pos - $match_start;
my $lex_length = $end_pos - $START;
return (substr($$STRING, $match_start, $match_length), $lex_length);
}
}
return;
}.
The string may contain Perl 5 code specifying a lexing closure, or be one of the special strings lex_q_quote
or lex_regex
.
If the string is lex_q_quote
Marpa uses an internal routine that implements the q-quoting rules described above. If the string is lex_regex
Marpa uses as the closure an internal routine that implements the rules for literal regexes also described above. This means the terminal paragraph for the regex
terminal can be very simple:
match regex using "lex_regex".
The comment in the example before last mentions that the code in that literal string could be replaced using an internal routine. Doing so would make self.marpa
simpler. Internalizing the closures used by MDL makes sense from a number of points of view. It's slightly more efficient and internalized routines become available to other high-level grammar interfaces. But Marpa self-generates from self.marpa
and in that self-generation I like to test user-specified lex actions. If I internalize them all, I'll lose that test.
Lexing closures gets the string to be matched as a pos
offset within the string referred to by $STRING
. If you don't know what a pos
offset is, or what \G
does in a Perl 5 regular expression, you will need to go back to the Perl documentation. To write a lex action you'll need to know that stuff.
$STRING
is a reference because the strings Marpa lexes will often be entire files, and may be very long. Copying the string and chopping it up as each token is lexed would be costly. So Marpa uses the same mechanism used by Perl 5 regexes with the g
modifier to track and move the position for matches.
In writing a lexing closure, keep the following in mind:
Simply matching
$$STRING
won't work, and will cause your parser to fail in almost every case.Always match from the current setting of
pos $STRING
, which is accessible with\G
anchoring.$STRING
is a reference, and for efficiency should be used that way. The string may be many megabytes long. Don't use any construct that copies it, unless you feel sure it's OK to limit use of your grammar to applications where efficiency is not an issue.Don't change
$$STRING
, or$STRING
. Please.Changes to
pos $$STRING
are OK. You do not need to worry about where your code leaves thepos
position in the string. Marpa will use the length you returned for the match to setpos $$STRING
when it wants to move lexing forward.Return failure with a bare
return
. If you need to throw an exception, usecroak()
.On success, return an array of two elements. The first element must be the value of the token at the match. This is often the matched substring, but it can be anything useful. The second element of the array must be the length of the token in earlemes (see the CONCEPTS document about earlemes). Lex closures in MDL should assume that a one-character-per-earleme model is in use.
Negative-length tokens are not allowed. Zero-length tokens will mess Marpa up just as totally. Token lengths must be greater than zero.
Lex closures run in the same namespace as rule actions, and therefore also see any globals set up in the preamble.
use strict
,use integer
and all warnings except recursion are in effect.The token length returned must include the length of any prefix removed.
$START
is set to the position before the prefix. If a non-empty prefix was removed, returning just the length of the matched string will usually cause lexing to get out of sync and parsing to fail. The code in the above example shows things being done correctly.
USING MDL SYMBOLS WITH THE RAW INTERFACE
MDL symbol names behave differently from those in the raw interface. MDL symbol names are allowed to vary in capitalization and separation, while every slight variation of capitalization or separation produces a new, unique raw interface name. MDL symbol names have a "canonical form", which MDL uses with the raw interface. Users should always use the Parse::Marpa::MDL::canonical_name()
method to do the conversion.
Parse::Marpa::MDL::canonical_name(MDL_symbol_name)
Parse::Marpa::MDL::canonical_form()
takes as its one argument an MDL symbol name and returns the canonical MDL name, which is also the symbol's raw interface name.
Canonical MDL names are all lowercase, with hyphens for separation. For example, the MDL symbol whose acceptable variants include My symbol
and MY_SYMBOL
is, in canonical form, my-symbol
.
Parse::Marpa::MDL::get_symbol(MDL_symbol_name)
Given an MDL symbol name as its one argument, Parse::Marpa::MDL::get_symbol()
will return the symbol's "cookie". Symbol cookies are needed to use the Parse::Marpa::Parse::earleme()
method.