NAME

Parse::Earley - Parse any Context-Free Grammar

VERSION

Parse::Earley version 0.15, July 24, 2002. I first began work on this module on July 17, 2002.

SYNOPSIS

use Parse::Earley;

$parser = new Parse::Earley;

# Set the grammar rules to those specified in $grammar
$parser->grammar($grammar);

# Initialize the parser state
$parser->start('mystartrule');

while (1) {
  # Advance the parser state to the next token of $str
  $parser->advance($str);
  print "Parse failed" if $parser->fails($str, 'mystartrule');
  print "Parse succeeded" if $parser->matches($str, 'mystartrule');
}

# Get parse graph
($tree) = $parser->matches($str, 'mystartrule');

DESCRIPTION

Overview

Parse::Earley accepts or rejects a string based on any Context-Free grammar, specified by a simplified yacc-like specification. It provides:

@

Regular expressions or literal strings as terminals,

@

Multiple (non-contiguous) productions for any rule,

@

The ability to extract all possible parse trees for a given input string (the parse graph),

@

Incremental extention of the parsing grammar (especially during a parse),

@

Boolean selection of whether a rule succeeds (but not return values).

@

(And the big win, once again) The ability to use any Context-Free Grammar you choose.

Comparison

When should you use Parse::Earley instead of Parse::RecDescent or Parse::Yapp?

@

When you need to match extremely complex and ambiguous grammars (for instance, a natural language :),

@

When you need all the possible parse trees, not just the left- or right-most.

That's surprisingly few. That's because Earley's algorithm is not economical for most causes regarding Context-Free Grammars. It runs in cubic time (that's slow) for any grammar, and quadratic time for unambiguous grammars.

However, there are no restrictions on the grammar whatsoever. This makes it one of the most popular algorithms for natural language processing. Not to mention, it runs quickly (sometimes moreso than RecDescent or Yapp) if the size of the input is sufficiently small.

Using Parse::Earley

You can create a parser object with the Parse::Earley::new function. It accepts no arguments; it just prepares the state of the parser so you can do more with it. This creation should always succeed.

Next, you should prime it with a grammar. Store the specification in a string, and pass it to the grammar method. The exact syntax of grammars will be discussed shortly. If you call grammar later, it extends the current grammar to include the new rules.

And there's just one more thing to do: Initialize the parser state with a start rule. Just call the start method, passing it the name of the rule with which to initialize the parser.

Now you can start matching the input string, by repeatedly calling the advance method. Each call advances the input by one token, in hope of someday allowing clean introspection.

The methods matched and failed check whether the parser accepts or rejects the input. Note that these can both be false (though they cannot both be true), and always are in the middle of parsing. matched_all checks whether the input match, and that no more parses are still possible; matched_all is more restrictive than matched.

The matched and matched_all functions, if true, return a list of complete states. You can use these to extract the parse graph, as discussed later.

Rules

Considering the early version, and the fact that I don't like hand-made parsers, the grammar specification is quite simple. It is mostly free-form. The only restriction is that productions must start in column 0, and nothing else (except comments) can.

Terminals can be specified using single- or double- quoted strings, as well as the q[] construct (with any delimiters). They can also be specified with a regular expression, using the usual /regex/ syntax, or the m|syntax| (with any delimiters). You cannot use # as a delimiter, because of the simple comment processing.

However, keep away from the null string ''. Because of the way Parse::Earley stores its states, this will not work the way you want. This includes regexes that will match nothing successfully. Just use a null rule alternated with an always-matching pattern.

Unlike Parse::RecDescent, no interpolation happens. But code conditions are still evaluated at runtime, however, their return value is only evaluated for truth, and not stored or pointed to. This may change in a future release.

Here's an example grammar of a simple calculator expression supporting +, -, *, /, and () grouping; it will only match values that will fit in one byte:

input: expr         # this represents the entire input

expr:  mul_expr '+' expr
    |  mul_expr '-' expr

# This is higher precedence.
mul_expr:  term q<*> mul_expr
        |  term q|/| mul_expr

term: '(' expr ')'  |  /\d+/  { $_ < 256 }

In that example you have seen all the features of the grammar specification.

Conditions

Above you saw the condition { $_ < 256 } . If the condition is after a terminal, $_ is set to the text matched by the terminal. If it's after a nonterminal, $_ is set to the parse graph rooted at the nonterminal it follows. This parse graph will only be as complete as it can at that point in the input. You should have to use the latter form.

Keep in mind that conditions are fundamentally different from actions. You can't use them to build a parse tree, because their return values are only boolean. They're there only to promote conditional matching of a rule, for instance, to see whether a particular word is an adjective.

If security is a concern, you may set the value $parser->{no_code} to disable conditions, where $parser is the parser object. You would do this if you allowed web users (like I do) to input their own grammars.

Terminal Separators

Parse::Earley now supports terminal seperators; that is, a pattern that is implicitly matched before each rule. If the directive <noskip> appears before a terminal, the terminal seperator is ignored for that match. However, this doesn't propigate (yet), so you have to put it before a terminal, otherwise it doesn't do any good.

The terminal seperator can be set with $parser->{skip}. It's a regular expression. It defaults to qr/\s*/.

The Parse Graph

Now, as useful as checking whether a certain string can be generated is, some people want to know how. Well, the parse graph tells you all ways you can possibly generate it.

First, after you've successfully matched an input string, get the matching parser state with the matched function. matched returns an array, though there's usually only one such state. So, just listify your variable:

($graph) = $parser->matched($input, 'mystartrule');

or

($graph) = $parser->matched_all($input, 'mystartrule');

if you want to be sure that you have all possible ways of matching.

Of course, you should get the whole list if you're really trying to be robust. Each element refers to a different way to match the start rule. This may change in the future to a "dummy state" that has it's down and left set up appropriately.

$graph should now have a parser state in it, but only if the input was accepted. If not, $graph will be undef.

$graph is a reference to a hash that has many things inside it, specifically everything the parser needs to know about this rule at this position. But you're probably only interested in the array refs $graph->{left}, $graph->{down}, and the scalars $graph->{tok} and $graph->{lhs}.

You traverse the parse graph backwards. If you're wondering why, it's because this is the only way to allow all possible parses in the same data structure. So you start at the end, and move $graph->{left} along it. If $graph->{left} has more than one element, those are possible alternatives at that point.

Essentially, $graph->{left} will move you backwards down the same production.

You'd also like to move down the tree, not just sideways. That's what $graph->{down} is for. It allows you to introspect inside a nonterminal to see what reduced to it. When you go $graph->{down}, you end up at the end of the nonterminal's production. If << $graph->{down} >> has more than one element, each corresponds to the same path as that index in the << $graph->{left} >> array.

$graph->{tok} is the actual text matched if the current state (or "node") refers to a terminal.

$graph->{lhs} is the name of the nonterminal that this state is attached to.

$graph->{tag} may eventually be added, to differentiate between two rules with the same lhs.

Parallel choices are given in canonical order, so if you want the leftmost tree, you always traverse $tree->{left}[0] and $tree->{down}[0]. Along the same lines, the rightmost tree is given by the last element in each array.

The Debugger

If the package variable $Parse::Earley::DEBUG is true, then after each state, Parse::Earley will output a summary of it's current state. For the following grammar:

S: E
E: E '+' T  |  T
T: /\d+/

And the following input:

1 + 2 + 3

After matching the first '+', the debugger output would look like this:

advance(): CURRENT SET
        1 + *  2 +       (3) E: E '+' * T (0)
        1 + *  2 +       (3) T: * /\d+/ (3)
advance(): NEXT SET
     1 + 2 *  + 3        (5) T: /\d+/ * (3)

If you are familiar with Earley's algorithm, this probably already makes sense to you. If not, it's pretty simple:

CURRENT SET represents what the parsers current state is. NEXT SET says what the parser will start with for the next token it matches.

The left side is a portion of the input, with a * stuck in there. The * represents where the parser is in the input. If you're tricky, you can create a grammar where rules in the same set have different input positions, but that's usually not the case.

The right side has a lot of information. The number in parentheses on the left displays the current input position, in numbers. The one on the right says where the rule in between them started (I should probably switch the two). In between, you see the definition of the rule, with another * stuck in there. This represents how much of the rule we've currently matched.

It's kinda fun to enable DEBUG, and put a <STDIN> after each advance. You can keep hitting enter and watch the parser progress.

Performance Issues

What if the parser's too slow? Well, there's one thing that I know of at this point that can speed it up: left recursion. Unlike Parse::RecDescent, Parse::Earley can handle left recursion... and it actually prefers it. If you can optimize your recursive rules to left recursive, you will see a significant performance increase, though it does change how your parse graph is ordered.

Future Versions

This is a useful module, but it is definitely minimalist. In future versions I expect to add:

@

A <commit> directive, to remove extraneous states if the current one matches. This could be a profound performance boost.

@

Perhaps change condition to action, in that, I could store return values. This could be misleading, however, because if you tried syntax-directed translation, you'd be traversing all the parse trees simultaneously.

@

Add m//X modifiers on matches. Right now it is a syntax error to specify /foo/i in a rule.

@

A lookahead string to improve performance, as specified in Jay Earley's paper.

Among many other things.

BUGS

There are undoubtedly bugs. I'm open to bug reports and patches. I'll get around to making the source more readable, so people can actually write patches.

Mail all of that to fibonaci@babylonia.flatirons.org

AUTHORS

Luke Palmer wrote the module and the documentation.

Jay Earley is the author of An Efficient Context-Free Parsing Algorithm. Communications of the ACM, 1970.

Damian Conway wrote Parse::RecDescent, which gave me the inspiration to take this on. The documentation for that module was also used as a loose template for this one.

COPYRIGHT

Copyright (C) 2002, Luke Palmer. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (http://www.perl.com/perl/misc/Artistic.html)