NAME

Marpa Demonstration Language - a language for describing grammars to Marpa

BEWARE: THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE

THIS DOCUMENT IS UNDER CONSTRUCTION AND VERY INCOMPLETE

OVERVIEW

The Marpa Demonstration Language (MDL) is a language for decribing grammars to Marpa. It's a high-level Marpa grammar interface -- as of this writing, the only one.

While it is Marpa's first high-level interface, it's intended not to have a privileged status within Marpa. Users can write their own high-level interfaces. It would be easy to write a more efficient one than the MDL. (Hint: don't use ambiguous lexing.) It would be also be easy to write a more powerful one. (Hint: just think of some cool feature and add it.)

And, humbling as it is to admit, it may not be all that hard to write a high-level interface to Marpa which is just plain better. There are better language designers than me out there. My goal with Marpa is to empower them.

Not all parsers can parse languages describing their own grammar. Marpa can. In fact, Marpa's parser started as a Marpa Demonstration Language file self.grammar, and the parser in Marpa was parsed using the same grammar interface and with the same methods available to the user. A initial bootstrap is needed to get around chicken-and-egg issues, but the result produced from self.marpa, then parses self.marpa a second time to eliminate all signs of the initial bootstrap. The second-generation Marpa parser it then runs self.marpa through a third generation. Only if the second and third generation parsing code are identical is the Marpa parser considered to be fully self-generated, and incorporated into a Parse::Marpa distribution.

Most of the examples of the Marpa grammar description language below are adopted from self.marpa or previous versions of it.

ELEMENTS OF MARPA GRAMMAR DESCRIPTIONS

Paragraphs and Sentences

The file is divided into paragraphs, separated by blank lines, that is lines containing only horizontal whitespace. Comments do not count as whitespace for the purpose of separating paragraphs. Paragraphs contain sentences, which must end in a period.

There are definition paragraphs, production paragraphs and terminal paragraphs. Definition paragraphs contain one or more definitions. For example,

semantics are perl5.  version is 0.1.59.  the start symbol is
grammar.

Reserved Words

All the reserved words and names in the Marpa grammar description language. are always entirely lower case, even at the beginning of a sentence. This is in line with the position Larry Wall took in his 2007 "State of the Onion" talk. The idea is that what the user is doing should be emphasized over the framework of the language.

User Specified Names

User specified names may contain any mix of both upper- and lower-case and are case-indifferent. That is, "Symbol" is the same user-specified name as "SYMBOL", "symbol" and even "sYmBoL". The user may use case as an expressive element, or to distinguish his names from Marpa's keywords.

Names may be more than one word and may be separated by whitespace or hyphens as the user chooses. User names are separation-indifferent as well as case-indifferent. "My symbol" and "my-symbol" are the same name.

A user specified name may be all lowercase, just like one of Marpa's keywords. This allows the user to reuse the Marpa description's keywords for his purposes. Marpa is far more aware of context than most parsers and context will often determine which is the correct choice. When the choice is ambiguous, the reserved word takes precedence. The user can force the word to be regarded a part of a user name by capitalizing one or more of its letters.

For example, in Marpa's self-definition, the following occurs:

rhs element: Optional rhs element.

As you'll see below, "optional" is a Marpa keyword which can be meaningful in that context, and if "optional" were lower case in this example, the Marpa's preferred parse would result in the above sentence being interpreted as a rule which states that a rhs element can consist either of nothing, or else itself. (Circular units rule and null rules are both legal in Marpa and so is combining them in this way. That's not to say it's a good identify for writing a useful or an efficient grammar.)

However, since "Optional" begins with a capital letter, it must be a user name or part of one. And in fact, an "Optional rhs element" is defined later as

optional rhs element: /optional/, rhs symbol specifier.

As will be explained in more detail below, this states that an optional rhs element is not an optional element at all. It an non-nullable and non-optional element, composed of the keyword "optional" followed by a rhs symbol specifier.

Note that on the left hand side of this second rule, the "optional" in optional rhs element is not capitalized. The "optional" keyword only makes sense on the right hand side. Marpa only interprets keywords as keywords when they "make sense" in context.

This is the same communication strategy we use in human languages. and it helps make human languages compact and expressive. We often use the same word to mean two different things, relying on context to resolve any ambiguity.

The parsing tools in general use today allow the user to use ambiguity only to a very limited degree. In Perl 5, Larry Wall was probably about as aggressive in the use of overloading and ambiguity as anyone using an LALR parser could be.

Literal strings

Literal strings can be single quoted, double quoted, q-quoted or qq-quoted. The syntax is much the same in Perl 5. Marpa recognizes backslashes, and will not terminate a single- or double-quoted string at a delimiting quote when it is preceded by a backslash.

For the q and qq strings, Marpa allows as delimiters everything in the POSIX punct character class, except backslash and the four right hand side bracketing symbols -- angle and square bracket, curly brace and parentheses. (These are the same restrictions that Perl 5 imposes, or are at least very close.) Backslashes escape the end delimiters in q and qq strings, just as they do in single and double-quoted strings.

Like Perl 5, Marpa treats a q or qq string with a left-hand bracketing symbol as a opening delimiter as special cases. The corresponding right hand bracketing symbol becomes the end-delimiter. Backslashes escape as usual. Nesting of the brackets within the quote is also tracked and the string will not terminate until there's an unescaped closing bracket at the same nesting level as the opening bracket.

Marpa's literal strings are often Perl 5 code, and it must be remembered that Marpa does not understand Perl 5 syntax. Treatment of brackets and string delimiters is more complex in Perl 5 than described above and Marpa's ideas of how to deal with them are limited to those described above.

This means that in complex cases, such as when an end delimiter appears in a Perl 5 character class, or within a string inside code Marpa's idea of where the closing delimiter is may not correspond to what Perl 5's would be. Users should choose from among Marpa's large variety of the quoting construct one that sidesteps potential issues.

Once a literal string is recognized, it's passed on unaltered to Perl 5 for the usual Perl 5 interpretation. The string literal is passed to Perl 5 delimited in the same way that it was in Marpa. You can, for example, use single and double quotes in the Marpa source file and expect Perl 5's interpretation of that string literal will follow the same rules as if you'd specified it directly to Perl 5.

Literal regexes

Literal regexes may are delimited either by slashes, or by qr-quoting. For example, the definition sentence

the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.

contains a qr-quoted regex, and

terminal sentence: symbol phrase, /matches/, regex, period.

contains a slash delimited regex.

Their actual evaluation is done by Perl 5. All MDL regexes are qr-quoted before they are passed to Perl 5. qr-quoted regexes are passed as is. Slash delimited regexes have qr-prepended to them and so are passed as qr-quoted slash-delimited regexes.

Treatment of regex delimiters in MDL follows the same rules are for strings, including acceptance as delimiters of all characters in the POSIX "punct" character class except backslashes and right bracketing characters; backslashing; and the special treatment of bracketing delimiters.

DEFINITION SENTENCES

Definition sentences contain the name of a Marpa predefined and its value, separated by the word "is" or "are". The name of the predefined may be preceded by "the".

The following all define the semantics of the grammar to be Perl 5.

semantics are perl5.
perl5 is the semantics.
perl5 is semantics.
the semantics are perl5.
the semantics is perl5.

Note that "is" or "are" always works. Marpa can't be bothered figuring out whether semantics is (are?) really singular or plural, and I think it has the right attitude about this. Marpa is similarly liberal about "is" versus "are" for all names, whether Marpa predefined or user-specified.

The syntax for the other definitions of predefineds obeys the above rules, except where spefically noted otherwise.

Semantics Definition

The semantics definition is not optional. Marpa is ultimately targeted to perl6, and that is considered its "default" semantics, even though it's not currently available. Currently, the only available semantics is perl5.

I require every Marpa grammar description to contain a line explicitly stating that its semantics are Perl 5, in order to limit problems with old Marpa source files once Perl 6 becomes available and the default.

Version Definition

version is 0.1.59.

This also is not optional, and as long as Marpa is in alpha, the version has to match exactly. This causes me a lot of trouble because all the test cases and examples and the bootstraping code must be edited whenever I up the version number, but nonetheless I regard it as a feature. It forces the user to be aware of version changes. This is essential while Marpa is in alpha, because versions will change frequently, and features will be volatile. There will be no attempt to maintain compatibility from version to version until Marpa goes beta.

Start Symbol Definition

the start symbol is grammar.

Pedantically speaking, the start symbol is optional, in the sense that it may be specified later using the raw interface. But if no start symbol has been specified by precomputation time, Marpa will fail.

String Definitions

Strings are used in Marpa for several important purposes. Many of the definitions require strings. "Actions" (the semantics of the rules) are specified as strings with Perl 5 code in them. Custom lexers can also be specified, and these also are strings containing Perl 5 code.

Here's an example of a string definition:

concatenate lines is q{
    my $v_count = scalar @$Parse::Marpa::This::v;
    return undef if $v_count <= 0;
    join("\n", grep { $_ } @$Parse::Marpa::This::v);
}.

Default Action Definition

You can specify a "default action", that is, an action to be used in rules which do not explicitly specify an action. By default, rules return undefined.

The current version of self.marpa uses the default. Here's the specification of a default action from an earlier version.

concatenate lines is the default action. 

In this case, the string specifier is concatenate lines, the string name defined in an earlier example. In any definition which takes a string at the value, the string may be specified either by name or as a literal string.

Default Null Value Definition

A "null value" is the value returned by an empty production. If an empty production is explicitly specified in the source file, its action becomes the "null value" for the symbol on the left hand side, and is used whenever it is nulled, whether directly through its own empty production, or indirectly through a series of other production which ultimately produce the empty string.

When a symbol's "null value" is not explicitly set, it defaults to Perl undefined. This default can be changed. For example, to have all nulled symbols without explicitly set null values evaluate to an at-sign, this definition can be used.

the default null value is q{@}.

Default Lex Prefix Definition

Terminals can be specified as patterns which the Parse::Marpa::Parse::text() method will automatically search for. Terminals are allowed to have a "prefix", another pattern which is to be checked for before the main pattern, but not treated as part of the terminal. A common use for lex prefixes is the elimination of leading whitespace.

The raw interface allows this pattern to be set separately for each terminal, and this ability will be added to the grammar description language. Where the lex prefix is not explicitly given, it defaults is //, a pattern the pattern which recognizes the empty string -- in effect, a no-op. This default can be reset with a default lex prefix definition sentence. Here's one from self.marpa:

the default lex prefix is qr/(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*/.

Preamble Definition

Actions for rules are evaluation is a special namespace created for each parse object. It's sometimes useful to have globals in this namespace initialized before any actions are run. Preamble definition sentences may be used for this purpose. Values in the preamble definitions are strings.

The preamble definition differs slightly from other definitions. In most definitions, a new value overwrites the old one. If there is more than one preamble definiton, the values are concatenated. Also, the article to be used before the name of the predefined it must be "a" and not "the". As with other definitions, the article may be omitted. Here's a simplified version of the self.marpa preamble:

    a preamble is q{
	our %strings;
    }.

PRODUCTION PARAGRAPHS

Here's a self-describing example, simplified from self.marpa:

    production paragraph:
	non structural production sentences,
	production sentence,
	non structural production sentences,
	optional action sentence,
	non structural production sentences.

A production paragraph is characterized by a production sentence. This may optionally be followed by an action sentence. Before and after these may be other "non structural production sentences" -- sentences which can go anywhere in the paragraph. The "non structural" sentences are comment sentences and priority sentences.

Production Sentence

Conventional Production Sentences

Again, let's start with a self-description:

production sentence: lhs, /:/, rhs, period.

A production sentence is a Backus-Naur Form production. (This document assumes you know what that is. If you don't, Wikipedia is a great place to start.) A production sentence consists of a left hand side and a right hand side, separated by a colon. The left hand side is the name of a single symbol.

A standard right hand side is a series of symbol names and regex literals, separated by commas, as in the above example, where lhs and rhs are symbol names, and /:/ is a regex.

Any symbol name or regex literal in a standard right hand side can be made optional by preceding it with the keyword optional. We've already seen several examples of those, but here's another:

default action setting:
optional /the/, /default/, /action/, /is/, action specifier.

If this case the regex /the/ is optional, so that the sentences

the default action is do whatever.

and

default action is do whatever.

are both valid default action setting's if do whatever has been defined as a string.

Making a nullable symbol (one which can match the empty string) basically does the same thing twice and Marpa doesn't allow it. it's usually best in such cases to rewrite the grammar to settle on one strategy or the other.

Sequence production sentences

Sequences of symbols are common in real-life grammars. (As an aside, sequences are not just of special practical interest, but are real keys to understanding in the theory of parsing.) Marpa provides four special forms of the production sentence for dealing with these.

definition paragraph: definition sequence.

non structural production sentences:
optional non structural production sentence sequence.

paragraphs: empty line separated paragraph sequence.

rhs: optional comma separated rhs element sequence.

In sequence production sentences, the right hand side of the production is a sequence description. Only one sequence description is allowed on a right hand side, and it must be the only thing on the right hand side.

(This restriction of sequence productions to a single sequence is not due to any limit of the Marpa parser. Extending productions to allow a series of elements on the right hand side including sequences and optional sequences would not be hard. Less easy is figuring out how to specify the child values for very complex right hand sides to the semantic actions. And it seemed to me the semantic actions in such right hand sides would be so complicated and bug-prone, so that nobody would really want to use the extension. But if there's demand, the one-sequence-to-a-production restriction would not be hard to lift.)

The right hand side of a sequence production may be a sequence or a separated sequence. A simple sequence is designated by a symbol name followed by the keyword sequence. If a symbol sequence is preceded by a symbol name followed by the keyword separated, then the first symbol name is used a separator.

The separated specified allows Perl-style trailing separators. For example, a rhs is specified above as a comma-separated sequence of rhs element's. The above would parse both of the following as valid rhs's:

optional /the/, /default/, /action/, /is/, action specifier
optional /the/, /default/, /action/, /is/, action specifier,

Preceding any sequence with the keyword optional means that the entire sequence may be omitted. In the usual regular expression notation X sequence is roughly equivalent to X+ and optional X sequence to X*.

The value of the child symbols in a sequence production is available in the array referenced by $Parse::Marpa::This::v. Unlike conventional productions, in sequence productions you don't know in advance the number of right hand side symbols or their indices in the array. You can deal with this by using operations like join or map which work on the entire array. You can obtain a count of the symbols on the right side and the index of the last symbol in the usual way, with with scalar @{$Parse::Marpa::This::v} $#$Parse::Marpa::This::v. Sequence productions work best where the semantics really are those of a sequence of semantically identical items. Any sequence production can be rewritten as conventional productions. Most parsers don't have sequence productions, and force you to write your sequences as one or more conventional productions. If your right hand side has more than sequence semantics, you may wish to do exactly that.

Having nullable symbols (symbols which might match an empty string) in a sequence is not a good idea, and Marpa throws an exception if you try it. If you define an X as a string of one or more non-nullable Y's, Marpa can recognize a single parse for the string "YY". Now suppose Y is nullable. How many null Y's are there in "YY", and where are they? A nullable in a sequence is almost always an error in grammar design.

But not always. Marpa's most basic test case is exactly that: creation of a sequence of nullables and a tally of the exponentially increasing results. The ban on nullables is only on sequences in sequence productions. Nothing prevents you from specifying a sequence of nullables with conventional productions, either for stress testing or any other use, if you can find one.

Action sentences

Action sentences describe a semantic action. In the MDL, actions are string containing Perl 5 code. For example, in self.marpa:

    action sentence:
    optional /the/, /action/, /is/, action specifier, period.
    q{
	"    action => "
	. $Parse::Marpa::This::v->[3]
    }.

the last four lines are an action sentence. You can also be more verbose.

    action sentence: action specifier, period.
    the action is q{
	"    action => "
	. $Parse::Marpa::This::v->[0]
In you find sequences of  grammar useful for something other than stress testing,
use conventional productions.
    }.

As a reminder, strings are delimited as Marpa strings, not a Perl 5 code. Many Perl 5 code constructs which you might expect to "escape" the delimiters will not.

The last value computed in the action will be the value returned by the production. Values from child productions are available through $Parse::Marpa::This::v, a reference to an array of them. The array contains the value for the symbols of the right hand side, in order. In both the examples above, a string is prepended to the value of the action specifier, and that becomes the value of action sentence in any parent production.

Users should be warned that direct availability of $Parse::Marpa::This::v is a temporary substitute. A set of convenient macros will soon be provided. Once the macros are available, access to $Parse::Marpa::This::v will be removed.

All actions start with use integer, and use strict in effect. All warnings are also in effect except those in the "recursion" category. The code is run in a special namespace. Globals in that namespace may be initialized in the actions specified in preamble definitions.

Priority Sentence

A priority sentence is the keyword priority followed by an integer value, which may be negative. The default priority is 0.

definition: setting, period.  q{ $Parse::Marpa::This::v->[0] }.  priority 1000.

In the example above, setting's are given a priority of 1000.

Priorities are used to affect the order in which ambiguous parses are returned, and they are a very important tool in Marpa. The priority setting in the above example addresses the parsing of this sentence.

a preamble is q{ our $var }.

This sentence has two parses in self.marpa. a preamble could be a user name, and the above sentence could be a string definition. Or "a" and "preamble" could be parsed as Marpa keywords, and in which case the string would be concatenated to the preamble setting.

The ambiguity is resolved by giving setting's, like the preamble setting, a higher priority than string definitions. That is, if a sentence in a definition paragraph can be interpreted as a setting, that will be the first parse. Since the MDL only uses its first parse, that means a setting becomes the parse in favor of any other.

String definitions in self.grammar have the default priority, zero. Only if a definition sentence cannot be interpreted as a setting, will Marpa fall back to treating it as a string definition.

Comment sentences

Comment sentences may occur anywhere in a production paragraph. They consist of a comment tag, followed by a colon and a series of whitespace-separated "comment words". Comment words may contain anything except whitespace or a period. The comment tags are "to do", "note" and "comment".

note: I may look a lot like a production sentence; Marpa knows the difference.

To force Marpa to treat something that looks like a comment sentence as a production, capitalize any letter in it. To force Marpa to treat something that looks like a production as a comment sentence, stick some illegal syntax somewhere in it. The above example uses a semi-colon.

TERMINAL PARAGRAPHS

A terminal paragraph is characterized by a terminal sentence. It may also contain comment sentences, as described above. Terminal sentences come in two forms, the regex form and the closure form.

Regex Form of Terminal Sentence

The regex form of a terminal sentence is a symbol name, the keyword matches, and a regex literal. For example,

comma matches qr/\,/.

These regex should not use any anchoring. Why you'd need captures, much less named ones, I don't know, but if you really must do so, you must treat all named captures beginning with "MARPA_" in any capitalization variant ("marpa_", "MaRpA_", etc.) are reserved.

Closure Form of Terminal Sentence

The closure form of a terminal sentence is the keyword match, then a symbol name, followed first by the keyword using, then a string name or a string literal. For example,

    note: change this to use lex_q_quote sometime(?).
    match single quoted string using q{
	state $prefix_regex = qr/\G(?:[ \t]*(?:\n|(?:\#[^\n]*\n)))*[ \t]*'/o;
	return unless $$STRING =~ /$prefix_regex/g;
	state $regex = qr/\G[^'\0134]*('|\0134')/;
	MATCH: while ($$STRING =~ /$regex/gc) {
	    next MATCH unless defined $1;
	    if ($1 eq q{'}) {
		my $length = (pos $$STRING) - $START;
		return (substr($$STRING, $START, $length), $length);
	    }
	}
	return;
    }.

The string may contain Perl 5 code specifying a lexing closure, or be one of the special strings lex_q_quote or lex_regex.

If the string is 'lex_q_quote' Marpa uses an internal routine that implements the q-quoting rules described above If the string is 'lex_regex' Marpa uses as the closure an internal routine that implements the rules for literal regexes described above. This makes the terminal paragraph for the regex terminal quite simple:

match regex using "lex_regex".

The comment in the previous example notes the code in that literal string could be replaced using an internal routine. It would make self.marpa simpler, and making all the enclosures used in the MDL Marpa internal would make sense. But Marpa self-generates from self.marpa and I like to have in that self-generation a test of lexing closures. If I make them all internals, I won't have that.

The lexing closure gets the string to be matched as a pos offset within the string referred to by $STRING. If you don't know what a pos offset is, or what \G does in a Perl 5 regular expression, you will need to go back to the Perl documentation. To write a lexing closure you'll need to know it cold.

$STRING is a reference because the strings Marpa lexes will often be entire files, and may be very long. Passing them by copying is not realization. Similarly, repeatedly chopping the string up as each token is lexed would be very inefficient. Instead Marpa uses the mechanism used by the g modifier for regular expressions to keep track of the internal position.

In writing a lexing closure, keep the following in mind:

  • Simply matching from the beginning of $$STRING won't work, and will cause your parser to fail in almost every case.

  • Always search from the current setting of pos $STRING, also accessible with \G anchoring.

  • Return failure to match with a simple return. If you really must throw an exception, use croak().

  • On success, return an array with one or two elements. The first element must be the value of the token at the match. This is often the matched substring, but it can be anything useful.

  • The second element of the array is the length of the token in earlemes (see the CONCEPTS document about earlemes). For lexing closures in MDL, the use of Parse::Marpa::Parse::text() and the one-character-per-earleme model can be assumed. Negative-length tokens are not allowed. Less obviously, zero-length tokens will mess Marpa up just as totally. Token lengths must be greater than zero.

  • If the first element of the array returned on success was a string, the second element is optional. The length of the token will be taken to be the same as the string length.

  • You do not need to worry about where your code leaves the pos position in the string. Marpa has to move it around anyway to implement ambiguous lexing, and will use the length you returned for the match to set pos $$STRING when it wants to move lexing forward.

  • Lexing closures run in the same namespace as semantic actions, and therefore also see any globals set up in the preamble. "use strict", "use integer" and all warnings except recursion are in effect.

  • As of this writing, lexing closures do not honor the default lex prefix. Code which want to remove prefixes, like the code in the example, must explicitly do so. This behavior no longer seems like a good idea. I will probably change it shortly. Since as of that version all lexing closures from previous versions will be broken, you may want to hold off on writing any.

  • $START is set to the position at which the string started, after taking prefix removal into account. Using $START properly will become very significant once prefixes are removed prior to calling the lexing closure. When a prefix was removed, simply returning the matched string or taking its length with length will produce the wrong answer. The code in the above example shows things being done right.