The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Parse::Stallion - Perl backtracking parser and resultant tree evaluator

SYNOPSIS

NOTE: this is still under the testing phase

  use Parse::Stallion;

  my %rules = (rule_name_1 => {..rule_definition..},
   rule_name_2 => {..rule_definition..}, ...);

  my $stallion = new Parse::Stallion({
    rules_to_set_up_hash => \%rules,
    start_rule => 'rule_name_1'
  });

  my $result = eval {$stallion->parse_and_evaluate({parse_this=>$given_string})};
  if ($@) {
    if ($stallion->parse_failed) {#parse failed};
  }

Rule Definitions:

  {and => ['child_rule_1', 'child_rule_2', ...], evaluation => sub{...}}

  {or => ['child_rule_1', 'child_rule_2', ...], evaluation => sub{...}}

  {multiple => 'child_rule_1', evaluation => sub{...}}

  {leaf => qr/regex/, evaluation => sub{...}}

DESCRIPTION

Stallion parses a string into a parse tree using entered grammar rules. The parsing is done in a depth first search manner, when a rule does not match the parser backtracks to a node that has another rule option. If successfully parsed, the tree is then evaluated in bottom up, left to right order, by calling each node's rule's subroutine.

Some familiarity is assumed with parsing, the grammars recognized are context free and are essentially to Extended Backus Normal Form.

The object being parsed does not need to be a string. Except for the section on non-strings, the documentation assumes strings are being parsed.

COMPLETE EXAMPLE

The following grammar reads in two unsigned integers and adds them.

  use Parse::Stallion;

   my %basic_grammar = (
    expression => {
     and => ['number', {regex_match => qr/\s*[+]\s*/},
      ['number', 'right_number'], {regex_match => qr/\z/}],
      evaluation => sub {return $_[0]->{number} + $_[0]->{right_number}}
    },
    number => {regex_match => qr/\d+/, evaluation => sub{return 0 + $_[0];}}
     #0 + $_[0] converts the matched string into a number
   );

   my $parser = new Parse::Stallion(
   {rules_to_set_up_hash => \%basic_grammar, start_rule => 'expression'});

   my $result = $parser->parse_and_evaluate({parse_this=>'7+4'});
   #$result should contain 11

RULES

There are 4 rule types: 'leaf', 'and', 'or', and 'multiple'.

One rule is the designated start rule from which parsing begins. The start rule can be of any type, though if the start rule is a 'leaf', the grammar is essentially just a regular expression.

After a successful parse, the external parse tree nodes correspond to the 'leaf' rules. The external nodes correspond to the substrings that the 'leaf' rules matched. The other rule types are matched with the internal nodes.

LEAF

A 'leaf' rule contains a regular expression that must match the beginning part of the remaining input string. During parsing, when a 'leaf' is matched, the matched substring is removed from the input string, though reattached if backtracking occurs.

Optionally, a 'leaf' rule can also contain a regular expression for which it must not match.

Examples:

  {rule_type => 'leaf', regex_match => qr/xx\w+/}

and, using a different notation,

  {'leaf' => qr/xx\w+/}

would match any perl word (\w+) starting with "xx".

  {rule_type => 'leaf', regex_match => qr/\w+/,
    regex_not_match => qr/qwerty/}

would match any perl word (\w+) except for those that begin with the string "qwerty".

AND

An 'and' rule contains a list of child rules that must be completely matched, from left to right, for the 'and' rule to match.

Examples (all are equivalent):

  {rule_type => 'and', composed_of => ['rule_1', 'rule_2', 'rule_3']}

  {composed_of => ['rule_1', 'rule_2', 'rule_3']}

  {and => ['rule_1', 'rule_2', 'rule_3']}

  {a => ['rule_1', 'rule_2', 'rule_3']}

Would verify that when the rule is first applied to the parse string that the 3rd character is 'Q'.

OR

An 'or' rule contains a list of child rules, one of which must be matched for the 'or' rule to match.

During parsing, the child rules are attempted to be matched left to right. If a child rule matches and then is subsequently backtracked, the parser will try to match the next child. If there is no next child, the rule is removed from the potential parse tree and the parser backtracks to the 'or' rule's parent.

Examples (equivalent):

  {rule_type => 'or', any_one_of => ['rule_1', 'rule_2', 'rule_3']};

  {any_one_of => ['rule_1', 'rule_2', 'rule_3']};

  {or => ['rule_1', 'rule_2', 'rule_3']};

  {o => ['rule_1', 'rule_2', 'rule_3']};

MULTIPLE (and OPTIONAL)

A 'multiple' rule contains one single child rule which must be matched repeatedly between a minimum and maximum number of times. The default minimum is 0 and the default maximum is unspecified, "infinite".

If the maximum is undef or 0 then there is no limit to how often the rule can be repeated. However, for there to be another repetition, the input string must have been shortened, else it would be considered a form of "left recursion".

Examples (equivalent):

  {rule_type => 'multiple', repeating => 'rule_1'};

  {repeating => 'rule_1'};

  {multiple => 'rule_1'};

  {m => 'rule_1', min=> 0, max => 0};

Examples (equivalent):

  {rule_type => 'multiple', repeating => 'rule_1',
   maximum => 10, minimum => 2};

  {repeating => 'rule_1', maximum => 10, minimum => 2};

  {multiple => 'rule_1', max => 10, minimum => 2};

  {repeating => ['rule_1', 2, 10]};

  {multiple => ['rule_1', 2, 10]};

  {m => ['rule_1', 2, 10]};

One can label a rule with the value 'optional' that maps to a 'multiple' rule with minimum 0 and maximum 1.

Examples (equivalent):

  {optional => 'rule_1'};

  {p => 'rule_1'};

  {rule_type => 'multiple', repeating => 'rule_1',
   min => 0, maximum => 1};

  {m => ['rule_1',0,1]};

In parsing, the child rule being matched is matched as many times as possible up to the maximum. If the parsing backtracks a child node is removed; if the number of child nodes falls below the minimum, all child nodes are removed and the 'multiple' rule node is removed from the parse tree.

Similarity between rule types.

The following rules all parse tree-wise equivalently.

  {rule_type => 'and', composed_of => ['sub_rule']};

  {a => ['sub_rule']};

  {rule_type => 'or', any_one_of => ['sub_rule']};

  {o => ['sub_rule']};

  {rule_type => 'multiple', repeating => ['sub_rule'], min => 1, max => 1};

  {m => ['sub_rule', 1, 1]};

NESTED RULES

Rules can be nested inside of other rules, cutting down on the code required. See the section EVALUATION for how nested rules affect tree evaluations.

To nest a rule, place it inside of a reference to a hash. Example:

  sum => {composed_of => ['number',
    {repeating => {composed_of => ['plus', 'number']}}]}

is equivalent parsing-wise to

  sum => {rule_type => 'and',
   composed_of => ['number', 'plus_numbers']};
  plus_numbers = {rule_type => 'multiple',
    repeating => 'plus_number'};
  plus_number => {rule_type => 'and',
   composed_of => ['plus', 'number']};

One can also use an alias for a rule. This does not affect the parsing, but does affect the names on the parse tree as well as evaluating the parse tree. Example:

  adding =  {rule_type => 'and',
   composed_of => ['number', {regex_match => qr/\s*[+]\s*/},
     ['number', 'right_number']};

RULE NAMES

Avoid naming rules with the substrings '__XX__', '__XY__', or '__XZ__', to avoid confliciting with the derived nested rules' names.

ENSURING RULES FORM COMPLETE GRAMMAR

Stallion ensures that a grammar is complete and 'croak's if the given grammar has any rules not reachable from the start rule or if within any rule a child rule does not exist.

PARSING

After a grammar has been set up, a string can be passed in to be parsed into a parse tree.

Parsing consists of copying the given string into an input string. Then a depth first search is performed following the grammar.

When a 'Leaf' rule is encountered, if the input string matches the rule, a substring is removed and the parsing continues forward; else, backtracking occurs.

It is expected that one will want to parse and evaluate a string but one may just generate a parse tree:

  my $stallion = new Parse::Stallion({
    rules_to_set_up_hash => \%rules,
    start_rule => 'rule_name_1'
  });

  my $results = $stallion->parse({parse_this=>$string_to_parse});

  $results->{parse_succeeded} is 1 if the string parses.
  $results->{parse_failed} is 1 if the string does not parse.
  $results->{tree} contains the parse tree if the string parses.

The tree is an internal object, Parse::Stallion::Talon, which has a function, that converts the parse tree into a string, each node consisting of one line:

  $results->{tree}->stringify({values=>['parse_match'],show_parent=>1});

NUMBER OF PARSE STEPS

One can set the maximum number of steps when parsing. If the parsing reaches the maximum number of steps without completing a parse tree, the parse fails.

A step is an action on a node, roughly speaking matching a regex for a 'leaf' rule, moving forward to check the next rule in and 'and' or 'or' rule, attempting to repeat the specified rule in a 'multiple' rule, or backtracking from a rule.

By default, the maximum number of steps is set to 20,000. The maximum number of steps is set by the max_steps parameter when calling parse or parse_and_evaluate:

  $stallion->parse_and_evaluate({max_steps=>100000, parse_this=>$string});

"LEFT RECURSION"

Stallion does not determine if a grammar is "left recursive" when creating the grammar. It may encounter "left recursiveness" during parsing in which case the parsing stops and a message is 'croak'ed.

"Left recursion" occurs during parsing when the same non-leaf rule shows up a next time on the parse tree and the input string has not changed when it showed up previously.

Illegal Case 1:

     expression => {and => ['expression', 'plus', 'term']};

Illegal Case 2:

     rule_with_empty => { and => ['empty', 'rule_with_empty', 'other_rule'];
     empty => {regex_match => qr//};

Illegal Case 3:

     rule_with_optional => { and => ['nothing', 'optional_rule', 'nothing'];
     nothing => { and => ['empty']};
     empty => {regex_match => qr//};
     optional_rule => {optional => 'some_other_rule'};
     some_other_rule => {regex_match => qr/x/};

The 3rd case will detect left recursiveness if the optional rule does not match.

If during the evaluation 2 'or' nodes are encountered with the the same rule and the same string, the 2nd 'or' node starts its evaluation with the any_one_of choice after the 1st 'or' node.. Brief example:,

     'or_rule' => {'or' => ['a', 'b']};
     'a' => {'and' => ['or_rule', 'x']};

the first time the or_rule is encountered, it first tries child 'a'. When parsing 'a', the 'or_rule' is encountered again, since the parsed string is the same when first encountered, the second 'or_rule' starts its evaluation at 'b'. QUESTION: should the parse grammar be illegal?

ON_START, ON_MATCH, ON_UNMATCH

Optionally, there may be a routine when creating a Stallion that, when parsing begins, is given two paremeters: the object being parsed and a hash, referred to as the match hash.

For each rule, a subroutine 'on_match' may be defined that is called whenever the rule is matched, i.e. after the last child of an 'and' rule. The parameters to the 'on_match' subroutine are the current object being parsed and the match hash. If the rule is a 'leaf', there is a third parameter, the value that is stored with the 'leaf' node.

Whenever a rule re-occurs during backtracking from its parent, the optional subroutine 'on_unmatch' is called with the parameters the current object being parsed and the match hash.

See the code in Parse::Stallion::CVSFH for an example of this.

EVALUATION

After parsing, Stallion can evaluate the parse tree in a bottom up left to right order traversal. When each node is encountered in the traversal, its subroutine is called with the parameters and the returned value of that subroutine will be used as a parameter to its parent subroutine, or in the case of a nested rule, up to the named rule containing the nested rule.

When setting up a rule, one can specify a subroutine to be executed during the evaluation, specified by the parameter 'evaluation' or 'e'.

The parameter to a leaf node's routine is the string the node matched with beginning and trailing white space removed. This removal can be overridden by setting the parameter keep_white_space when creating the object:

  $parser = new Parse::Stallion({keep_white_space => 1});

The parameter to an internal node is a hash consisting of named parameters corresponding to the child rules of a node's rule. If a child rule only occurs once in the definition of its parent rule, the hash parameter is a single value, else the hash parameter corresponds to a reference to an array of all the child values.

By nesting a rule with an alias, the alias is used for the name of the hash parameter instead of the rule name.

Default evaluation routine

If a rule does not have an evaluation routine specified, a default subroutine is used which does one of two things:

  • If the passed in hash reference has only one key, then the value of that key in the hash reference is returned.

  • If the passed in hash reference has more than one key, then the hash reference is returned.

This is the routine:

  sub new_generic_routine {
    my $parameters = shift;
    if (ref $parameters eq 'HASH') {
      if (keys %$parameters == 1) {
        my ($key) = keys %$parameters;
        return $parameters->{$key};
      }
    }
    return $parameters;
  }

To aid in assisting which rules return array refs and which are single values one can call the routine which_parameters_are_arrays.

MORE COMPLICATED EXAMPLE

  The following is a simple calculator:

   %calculator_rules = (
    start_expression => {
      and => ['expression', {regex_match => qr/\z/}],
      evaluation => sub {return $_[0]->{expression}},
     },
    expression => {
      and => ['term', 
       {repeating => {composed_of => ['plus_or_minus', 'term'],},},],
      evaluation => sub {my $to_combine = $_[0]->{term};
       my $plus_or_minus = $_[0]->{plus_or_minus};
       my $value = shift @$to_combine;
       for my $i (0..$#{$to_combine}) {
         if ($plus_or_minus->[$i] eq '+') {
           $value += $to_combine->[$i];
         }
         else {
           $value -= $to_combine->[$i];
         }
       }
       return $value;
      },
     },
    term => {
      composed_of => ['number', 
       {repeating => {composed_of => ['times_or_divide', 'number']}}],
      evaluation => sub {my $to_combine = $_[0]->{number};
       my $times_or_divide = $_[0]->{times_or_divide};
       my $value = shift @$to_combine;
       for my $i (0..$#{$to_combine}) {
         if ($times_or_divide->[$i] eq '*') {
           $value *= $to_combine->[$i];
         }
         else {
           $value /= $to_combine->[$i]; #does not check for zero
         }
       }
       return $value;
      }
    },
    number => {
      regex_match => qr/\s*[+\-]?(\d+(\.\d*)?|\.\d+)\s*/,
      evaluation => sub{ return 0 + $_[0]; }
    },
    plus_or_minus => {
      regex_match => qr/\s*[\-+]\s*/,
    },
    times_or_divide => {
      regex_match => qr/\s*[*\/]\s*/
    },
   );

   $calculator_parser = new Parse::Stallion({
     rules_to_set_up_hash => \%calculator_rules,
     start_rule => 'start_expression'
   });

   my $result = $calculator_parser->parse_and_evaluate({parse_this=>"3+7*4"});
   # $result should contain 31

  my $array_p = $calculator_parser->which_parameters_are_arrays({
    rule_name => 'term'});
  # $array_p would be {number => 'Array', times_or_divide => 'Array'}

  $array_p = $calculator_parser->which_parameters_are_arrays({
    rule_name => 'start_expression'});
  # $array_p would be {expression => 'Single Value'}

PARSING NON-STRINGS

In order to parse something other than a string, three subroutines must be provided: an increasing_value function for ensuring parsing is proceeding correctly, a leaf rule matcher/modifier for when the parser is moving forward, and a 'leaf' rule unmodifier for when the parser is backtracking.

  $stallion->set_handle_object({
    parse_forward =>
     sub {
       my $object_ref = shift;
       my $parameters = shift;
       ...
       return ($true_if_object_matches_rule,
        $value_to_store_in_leaf_node);
     },
    parse_backtrack =>
     sub {
       my ($object_ref, $value_stored_in_leaf_node) = @_;
       ...
      },
    increasing_value_function =>
     sub {
       my $object = shift;
       ...
       return $value_of_object;
     }
  })

When evaluating the parse tree, the parameters to the leaf nodes are the values returned in parse_foward, $value_to_store_in_leaf_node.

'Leaf' Leaf parse forward/backtrack

All leaf rules need to be set up such that when the parser is moving forward and reaches a 'leaf', the 'leaf' rule attempts to match the current input object. If there is a match, then the input object is modified to the object's next state and a value is stored to be called upon later during tree evaluation.

When backtracking, the input object should be reverted to the state before being matched by the 'leaf' rule.

In parsing a string, substrings are removed from the beginning of the string and reattached to the beginning when backtracked.

Increasing_value

A function, called 'increasing_value', must be provided that takes the object being parsed and returns a numeric value that either is unchanged or increases after the 'leaf' rule's match_and_modify_input_object is called.

This function is used to detect and prevent "left recursion" by not allowing a non-leaf rule to repeat at the same value. 'Multiple' rules are prevented from repeating more than once at the same value.

The function also speeds up parsing, cutting down on the number of steps by not repeating dead-end parses. If during the parse, the same rule is attempted a second time on the parse object with the same increasing_value, and the first parse did not succeed, then Stallion will note that the parsing was blocked before and begin backtracking.

In parsing a input string, the negative of the length of the input string is used as the increasing function.

Strings

By default, strings are matched, which is similar to

  $calculator_stallion->set_handle_object({
    parse_forward =>
     sub {
      my $input_string_ref = shift;
      my $rule_definition = shift;
      my $match_rule = $rule_definition->{regex_match} ||
       $rule_definition->{leaf} ||
       $rule_definition->{l};
      if ($$input_string_ref =~ /\A($match_rule)/) {
        my $matched = $1;
        my $not_match_rule = $rule_definition->{regex_not_match};
        if ($not_match_rule) {
          if (!($$input_string_ref =~ /\A$not_match_rule/)) {
            return (0, undef);
          }
        }
        $$input_string_ref = substr($$input_string_ref, length($matched));
        return (1, $matched);
      }
      return 0;
     },

    parse_backtrack =>
     sub {
      my $input_string_ref = shift;
      my $stored_value = shift;
      if (defined $stored_value) {
        $$input_string_ref = $stored_value.$$input_string_ref;
      }
     },

    increasing_value_function => sub {
      my $string = shift;
      return 0 - length($string);
    }
  });

Scanned Arrays

The following lexical analyzer/parser combination illustrates parsing a non-string. The lexical analyzer may "parse" the input string using grammar rules that identify tokens, resulting in a list. The list/array of tokens would then be "parsed". This second parsing would be of an array, not a string.

Parse::Stallion has built in support for parsing a scanned array. This is done by setting the parameter 'scanner' to true when creating the Parse::Stallion object. In addition, 'leaf' rules should be created with 'scan_leaf' instead of 'leaf'.

In parsing a scanned array, the first element of the array is shift'ed off. When backtracking, the element is unshift'ed back onto the array. In parsing a scanned array, the negative of the number of items (0-$#array) is used as the increasing_value function.

AUTHOR

Arthur Goldstein, <arthur@acm.org>

COPYRIGHT AND LICENSE

Copyright (C) 2007-8 by Arthur Goldstein Not released, testing only.

BUGS

Testing phase, please email in bug reports.

TO DO AND FUTURE POSSIBLE CHANGES

This is in the test phase, please send suggestions.

OTHERS

Determine if 'or' rules should have behavior that they cannot be repeated on the same string or they should just not repeat the chosen child on the same string (current behavior).

Document or remove other %results returned.

The code uses global variables, this is done deliberately in order to speed up the search, though not clear if this is really need. Benefits of not having global variables would be possibly cleaner code as well as possibly being thread safe.

Examples of routines: which_parameters_are_arrays, increasing_value, make_sure_all_names_covered, and make_sure_all_rules_reachable.

Have an option to turn off backtracking (on ands?)? This would speed up some grammars, and have the parser behave as a left recursive descent parser?

Simplify documentation by not saying how code works?

Right now Parse::Stallion requires Test::More and Time::Local and perl 5.6 or higher. This is due to the installation test cases and the way the makefile is set up. Should work with earlier versions of perl and neither of those modules is really required.

The test cases that come with the module include some interesting code, such as parsing dates and compiling/running a program. Should expand these out into an examples directory.

Are scan_array's so important as to warrant a separate parameter?

SEE ALSO

Look up Extended Backus-Naur Form notation and trace back from there.

Please send suggestions. What comes to mind is lex, yacc, Parse::RecDescent, ..., other parsers.