The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Parse::Stallion - Perl backtracking parser and resultant tree evaluator

SYNOPSIS

  use Parse::Stallion;

  my %rules = (rule_name_1 => {..rule_definition..},
   rule_name_2 => {..rule_definition..}, ...);

  my $stallion = new Parse::Stallion({
    rules_to_set_up_hash => \%rules,
    start_rule => 'rule_name_1', #default is rule which is not subrule
    do_evaluation_in_parsing => 0, #default is 0
    remove_white_space => 1, #default is 0
  });

  my $result = $stallion->parse_and_evaluate($given_string);

  my ($value, $parse_info) =
   $stallion->parse_and_evaluate({parse_this => $s, max_steps => 10});

Rule Definitions:

  {and => ['subrule_1', 'subrule_2', ...], evaluation => sub{...}}

  {or => ['subrule_1', 'subrule_2', ...], evaluation => sub{...}}

  {multiple => 'subrule_1', evaluation => sub{...}}

  {leaf => qr/regex/, evaluation => sub{...}}

DESCRIPTION

Stallion parses and evaluates a string using entered grammar rules. The parsing is done top-down via an initial start rule, in a depth first search forming a parse tree. When a rule does not match the parser backtracks to a node that has another option.

The evaluation subroutines are given a reference to a hash representing the returned values of the named sub-nodes. The evaluation subroutine for each node may be done while creating the parse tree and reject a match affecting which strings parse. This allows complex grammars.

If the evaluation is not done while parsing, if the string is successfully parsed, the parse tree is evaluated in bottom up, left to right order, by calling each tree node's rule's subroutine.

The grammars recognized are context free and are similar to Extended Backus Normal Form.

The object being parsed does not need to be a string. Except for the section on non-strings, the documentation assumes strings are being parsed.

COMPLETE EXAMPLES

The following examples read in two unsigned integers and adds them.

  use Parse::Stallion;

   my %basic_grammar = (
    expression => {
     and => ['number',
       {regex_match => qr/\s*\+\s*/},
      'number'],
      evaluation =>
       sub {return $_[0]->{number}->[0] + $_[0]->{number}->[1]}
    },
    number => {regex_match => qr/\d+/,
      evaluation => sub{return 0 + $_[0];}}
     #0 + $_[0] converts the matched string into a number
   );

   my $parser = new Parse::Stallion(
   {rules_to_set_up_hash => \%basic_grammar});

   my $result = $parser->parse_and_evaluate('7+4');
   #$result should contain 11

   my %grammar_2 = (
    expression => {
     and => ['number',
      {regex_match => qr/\s*\+\s*/},
      ['number', 'right_number']],
      evaluation => sub {return $_[0]->{number} + $_[0]->{right_number}}
    },
    number => {regex_match => qr/\d+/,
      evaluation => sub{return 0 + $_[0];}}
   );

   my $parser_2 = new Parse::Stallion(
   {rules_to_set_up_hash => \%grammar_2, start_rule => 'expression'});

   my $result_2 = $parser_2->parse_and_evaluate('8+5');
   #$result_2 should contain 13

RULES

There are 4 rule types: 'leaf', 'and', 'or', and 'multiple'.

One rule is the designated start rule from which parsing begins. If the 'start_rule' parameter is omitted, the one rule which is not a subrule is used as the start rule. The start rule can be of any type, though if the start rule is a 'leaf', the grammar is essentially just a regular expression.

After a successful parse, the external nodes correspond to the substrings that the 'leaf' rules matched; the other rule types correspond to the internal nodes.

LEAF

A 'leaf' rule contains a regular expression that must match the beginning part of the remaining input string. During parsing, when a 'leaf' matches, the matched substring is removed from the input string, though reattached if backtracking occurs.

Optionally, a 'leaf' rule can also contain a regular expression for which it must not match.

Examples:

  {regex_match => qr/xx\w+/}

and, using a different notation,

  {'leaf' => qr/xx\w+/}

would match any perl word (\w+) starting with "xx".

  {regex_match => qr/\w+/, regex_not_match => qr/qwerty/}

would match any perl word (\w+) except for those that begin with the string "qwerty".

AND

An 'and' rule contains a list of subrules that must be completely matched, from left to right, for the 'and' rule to match.

Examples (equivalent):

  {and => ['rule_1', 'rule_2', 'rule_3']}

  {a => ['rule_1', 'rule_2', 'rule_3']}

OR

An 'or' rule contains a list of subrules, one of which must be matched for the 'or' rule to match.

During parsing, the subrules are attempted to be matched left to right. If a subrule matches and then is subsequently backtracked, the parser will try to match the next subrule. If there is no next subrule, the rule is removed from the potential parse tree and the parser backtracks to the 'or' rule's parent.

Examples (equivalent):

  {or => ['rule_1', 'rule_2', 'rule_3']};

  {o => ['rule_1', 'rule_2', 'rule_3']};

MULTIPLE (and OPTIONAL)

A 'multiple' rule matches if each of its subrules match repeatedly between a minimum and maximum number of times. The default minimum is 0 and the default maximum is "infinite".

If the maximum is undef or 0 then there is no limit to how often the subrules can be repeated. However, for there to be another repetition, the input string must have been shortened, else it would be considered an illegal form of "left recursion".

If the parameter 'match_min_first' is not set the maximal number of allowed/possible matches of the repeating rule are tried and then if backtracking occurs, the number of matches is decremented by one. Else, if 'match_min_first' is set, the minimal number of matches is tried first and the number of matches increases when backtracking.

Examples (equivalent):

  {multiple => ['rule_1', 'rule_2']};

  {m => ['rule_1', 'rule_2']};

  {m => ['rule_1', 'rule_2], match_min_first => 0,
    minimum_children=> 0, maximum_children => 0,
    minimum_child_count => 0, maximum_child_count => 0};
  }

One can label a rule with the value 'optional' that maps to a 'multiple' rule with minimum 0 and maximum 1.

Examples (equivalent):

  {optional => 'rule_1'};

  {zero_or_one => 'rule_1'};

  {z => 'rule_1'};

  {'multiple' => 'rule_1',
   minimum_child_count => 0, maximum_child_count => 1,
   minimum_children => 0, maximum_children => 1};

SIMILARITY BETWEEN RULE TYPES.

The following rules all parse tree-wise equivalently.

  {and => ['subrule']};

  {o => ['subrule']};

  {m => 'subrule',
   minimum_child_count => 1, maximum_child_count => 1,
   minimum_children => 1, maximum_children => 1};

The following are equivalent:

  {and => ['subrule 1','subrule 2','subrule 3']};

  {'multiple' => ['subrule 1','subrule 2','subrule 3'],
   minimum_child_count => 1, maximum_child_count => 1};

The following are equivalent:

  {or => ['subrule 1','subrule 2','subrule 3']};

  {'multiple' => ['subrule 1','subrule 2','subrule 3'],
   minimum_child_count => 0, maximum_child_count => 1
   minimum_children => 1, maximum_children => 1};

NESTED RULES

Rules can be nested inside of other rules. See the section EVALUATION for how nested rules affect tree evaluations.

To nest a rule, place it inside of a reference to a hash. Example:

  sum => {and => ['number',
    {multiple => {and => ['plus', 'number']}}]}

is equivalent parsing wise to

  sum => { and => ['number', 'plus_numbers']};
  plus_numbers = { multiple => 'plus_number'};
  plus_number => { and => ['plus', 'number']};

One can also use an alias for a rule. This sets the name used when evaluating the parsed expression, but does not affect the parsing.

Example:

  adding =  {
   and => ['number', {regex_match => qr/\s*[+]\s*/},
     ['number', 'right_number'],
   e => sub {return $_[0]->{number} + $_[0]->{right_number}}
  };

RULE NAMES

Avoid naming rules with the substring '__XZ__', to avoid confliciting with the internal names.

ENSURING RULES FORM COMPLETE GRAMMAR

Stallion ensures that a grammar is complete and 'croak's if the given grammar has any rules not reachable from the start rule or if within any rule a subrule does not exist.

PARSE_AND_EVALUATE

After a Parse::Stallion is set up, strings are parsed and evaluated via parse_and_evaluate. In scalar context, the returned value is the returned value of the top node's evaluation routine.

RETURNED VALUES

If parse_and_evaluate is called in wantarray context, the first value returned in the result of the evaluation and the second is information on the parse.

  my ($value, $parse_info) =
   $stallion->parse_and_evaluate($given_string);

  $parse_info->{parse_succeeded}; # is 1 if the string parses.
  $parse_info->{parse_failed}; # is 1 if the string does not parse.
  $parse_info->{number_of_steps}; # number of steps parsing took
  $parse_info->{unparsed}; # unmatched part of string, see END_OF_PARSING
  $parse_info->{start_rule};

  $parse_info->{parse_trace};
  # reference to array of hashes showing each step, the hash keys are
  #  1) rule_name
  #  2) moving_forward (value 0 if backtracking),
  #  3) moving_down (value 0 if moving up parse tree)
  #  4) value (display_value_function of parsed object,
  #     by default the yet unparsed portion of the input string)

  $parse_info->{tree}; # the parse tree if the string parses.

The tree is an internal object having a function, that converts the tree into a string, each node consisting of one line:

  $results->{tree}->stringify({values=>['name', 'parse_match']});

This displays the names and for leaf nodes the matched string. Internal node names will show up.

NUMBER OF PARSE STEPS

If the parsing reaches the maximum number of steps without completing a parse tree, the parse fails.

A step is an action on a node, roughly speaking matching a regex for a 'leaf' rule, or moving forward or backtracking from a rule.

By default, the maximum number of steps is set to 20,000. The maximum number of steps can be changed:

  $stallion->parse_and_evaluate({max_steps=>100000, parse_this=>$string});

"PARSE_FUNCTION"

One can pass in a subroutine parse_function that is executed at every parse step and takes the current object as the parameter and its returned value is the new current object.

   my $step = 1;
   my $parser = new Parse::Stallion(
   {rules_to_set_up_hash => \%basic_grammar, start_rule => 'expression',
    parse_function => sub {my $o = shift;print "step ".$step++."\n";return $o}
   });

Parse::Stallion::CSVFH uses this to read from a file handle.

"LEFT RECURSION"

Parse::Stallion may encounter "left recursiveness" during parsing in which case the parsing stops and a message is 'croak'ed.

"Left recursion" occurs during parsing when the same non-'leaf' rule shows up a second time on the parse tree with the "same" input string, measured by the increasing_value_function.

Illegal Case 1:

     expression => {and => ['expression', 'plus', 'term']};

Illegal Case 2:

     rule_with_empty => { and => ['empty', 'rule_with_empty', 'other_rule'];
     empty => {regex_match => qr//};

Illegal Case 3:

     rule_with_optional => { and => ['nothing', 'optional_rule', 'nothing'];
     nothing => { and => ['empty']};
     empty => {regex_match => qr//};
     optional_rule => {optional => 'some_other_rule'};
     some_other_rule => {regex_match => qr/x/};

The 3rd case will detect left recursiveness if the optional rule does not match and modify the input.

EVALUATION

Evaluation can occur during or after parsing.

If after parsing, Stallion will evaluate the parse tree in a bottom up left to right order traversal. When each node is encountered in the traversal, the node's subroutine is called with the parameters and the returned value of that subroutine will be used as a parameter to the node's parent's subroutine, or in the case of a nested rule, up to the ancestral node's subroutine with the named rule containing the nested rule.

When setting up a rule, one can specify a subroutine to be executed during the evaluation, specified by the parameter 'evaluation', 'eval', or 'e'.

The parameter to a leaf node's evaluation routine is the string the node matched. The beginning and trailing white space can be removed before being passed to the routine by by setting the parameter remove_white_space when creating the parser:

  $parser = new Parse::Stallion({remove_white_space => 1});

The parameter to an internal node is a hash consisting of named parameters corresponding to the subrules of a node's rule. If a subrule can only occur once in the parse tree as a child node of a rule's node, the hash parameter is a single value, else the hash parameter corresponds to an array reference to an array.

By nesting a rule with an alias, the alias is used for the name of the hash parameter instead of the rule name.

EVALUATION DURING PARSING

If the do_evaluation_in_parsing is set when a Parse::Stallion object is created the evaluation occurs during the parsing instead of afterwards.

Every time a node is matched, its evaluation routine is called as it would be during evaluation after parsing. This is possible because a node cannot be matched until all of its children are matched.

The evaluation routine may return a second parameter that tells Parse::Stallion to reject or not reject the match. This allows more control over what can be parsed.

EVALUATION DURING PARSING EXAMPLE

The following is an example of a more complicated "grammar" being parsed.

The first statement tells a truth about the number of elements in a list, the second tells whether or not the first statement is true. If the second statement is true, the string parses.

  my %parsing_rules = (
   start_expression => {
    and => ['two_statements', {leaf => qr/\z/}],
    evaluation => sub {return $_[0]->{'two_statements'}},
   },
   two_statements => {
     and=> ['list_statement','truth_statement'],
     evaluation => sub {
       if ($_[0]->{list_statement} != $_[0]->{truth_statement}) {
         return (undef, 1);
       }
       return 1;
     }
   },
   list_statement => {
     and => ['count_statement', 'list'],
     evaluation => sub {
       if ($_[0]->{count_statement} == scalar(@{$_[0]->{list}})) {
         return 1;
       }
       return 0;
     }
   },
   count_statement => {
     and => [{leaf=>qr/there are /i},'number',{l=>qr/ elements in /}],
     evaluation => sub {
       return $_[0]->{number};
     }
    },
   number => {
    leaf=>qr/\d+/,
     evaluation => sub { return 0 + shift; }
   },
   list => {and => ['number', {multiple=>{and=>[{l=>qr/\,/}, 'number']}}],
    evaluation => sub {return $_[0]->{number}}
   },
   truth_statement => {
     or => [{l=>qr/\. that is the truth\./, alias=>'t'},
      {l=>qr/\. that is not the truth\./, alias=>'t'}],
     evaluation => sub {
       if ($_[0]->{t} =~ /not/) {
         return 0;
       }
       return 1;
     }
   },
  );
  
  my $how_many_parser = new Parse::Stallion({
    do_evaluation_in_parsing => 1,
    rules_to_set_up_hash => \%parsing_rules,
    start_rule => 'start_expression',
  });
  
  $result = $how_many_parser->parse_and_evaluate(
    "there are 5 elements in 5,4,3,2,1. that is the truth.");
  
  print "$result should be 1\n";
  
  $result = $how_many_parser->parse_and_evaluate(
    "there are 5 elements in 5,4,3,1. that is not the truth.");
  
  print "$result should be 1\n";
  
  $result = $how_many_parser->parse_and_evaluate(
    "there are 5 elements in 5,4,3,1. that is the truth.");
  
  print "$result should be undef\n";

DEFAULT EVALUATION ROUTINE

If a rule does not have an evaluation routine specified, a generic subroutine is used. Two routines are provided:

Generic Evaluation Routine 1

  • If the passed in hash reference has only one key, then the value of that key in the hash reference is returned.

  • If the passed in hash reference has more than one key, then the hash reference is returned.

This is the routine:

  sub {
    my $parameters = shift;
    if (ref $parameters eq 'HASH') {
      if (keys %$parameters == 1) {
        my ($key) = keys %$parameters;
        return $parameters->{$key};
      }
    }
    return $parameters;
  }

Generic Evaluation Routine 2

The second default evaluation routine will be used if do_not_compress_eval is set when creating the parser. It does not remove the names of the rules and the code is simply:

  sub return_self_routine {
    return shift;
  }

Example:

  my %no_eval_rules = (
   start_rule => { and => ['term',
    {multiple => {and => [[{regex_match=>qr/\s*\+\s*/},'plus'], 'term']}}]},
   term => { and => [['number','left'],
    {multiple => {and => [[{regex_match=>qr/\s*\*\s*/},'times'],
     ['number','right']]}}]},
   number => {regex_match => qr/\s*\d*\s*/},
  );                               
                      
  my $no_eval_parser = new Parse::Stallion({do_not_compress_eval => 0,
   rules_to_set_up_hash => \%no_eval_rules,
   });

  my $no_eval_parser = new Parse::Stallion({do_not_compress_eval => 0,
   rules_to_set_up_hash => \%no_eval_rules,
   });
  
  $result = $no_eval_parser->parse_and_evaluate({parse_this=>"7+4*8"});
   
  #$result contains:
  { 'plus' => [ '+' ],
    'term' => [ '7',
                { 'left' => '4',
                  'right' => [ '8' ],
                  'times' => [ '*' ] } ] }

  my $dnce_no_eval_parser = new Parse::Stallion({do_not_compress_eval => 1,
   rules_to_set_up_hash => \%no_eval_rules,
   });

  $result = $dnce_no_eval_parser->parse_and_evaluate({parse_this=>"7+4*8"});

  #$result contains:
  { 'plus' => [ '+' ],
    'term' => [ { 'left' => '7' },
                { 'left' => '4',
                  'right' => [ '8' ],
                  'times' => [ '*' ] } ] }

Parameters to Evaluation Routines

To aid in assisting which rules return array refs and which are single values one can call the routine which_parameters_are_arrays.

MORE COMPLICATED EXAMPLE

  The following is a simple calculator:

   %calculator_rules = (
    start_expression => {
      and => ['expression', {regex_match => qr/\z/}],
      evaluation => sub {return $_[0]->{expression}},
     },
    expression => {
      and => ['term', 
       {repeating => {and => ['plus_or_minus', 'term'],},},],
      evaluation => sub {my $to_combine = $_[0]->{term};
       my $plus_or_minus = $_[0]->{plus_or_minus};
       my $value = shift @$to_combine;
       for my $i (0..$#{$to_combine}) {
         if ($plus_or_minus->[$i] eq '+') {
           $value += $to_combine->[$i];
         }
         else {
           $value -= $to_combine->[$i];
         }
       }
       return $value;
      },
     },
    term => {
      and => ['number', 
       {repeating => {and => ['times_or_divide', 'number']}}],
      evaluation => sub {my $to_combine = $_[0]->{number};
       my $times_or_divide = $_[0]->{times_or_divide};
       my $value = shift @$to_combine;
       for my $i (0..$#{$to_combine}) {
         if ($times_or_divide->[$i] eq '*') {
           $value *= $to_combine->[$i];
         }
         else {
           $value /= $to_combine->[$i]; #does not check for zero
         }
       }
       return $value;
      }
    },
    number => {
      regex_match => qr/\s*[+\-]?(\d+(\.\d*)?|\.\d+)\s*/,
      evaluation => sub{ return 0 + $_[0]; }
    },
    plus_or_minus => {
      regex_match => qr/\s*[\-+]\s*/,
    },
    times_or_divide => {
      regex_match => qr/\s*[*\/]\s*/
    },
   );

   $calculator_parser = new Parse::Stallion({
     rules_to_set_up_hash => \%calculator_rules,
     start_rule => 'start_expression'
   });

   my $result = $calculator_parser->parse_and_evaluate("3+7*4");
   # $result should contain 31

  my $array_p = $calculator_parser->which_parameters_are_arrays({
    rule_name => 'term'});
  # $array_p would be {number => 'Array', times_or_divide => 'Array'}

  $array_p = $calculator_parser->which_parameters_are_arrays({
    rule_name => 'start_expression'});
  # $array_p would be {expression => 'Single Value'}

END_OF_PARSING

By default, the entire string being parsed must be matched. This can be changed with the end_of_parse_allowed parameter which is a routine that is given the string/object being parsed.

The following example would let any string that generated a parse tree match:

   $aa_parser = new Parse::Stallion({
     rules_to_set_up_hash => {start_rule => {l => qr/aa/}},
     end_of_parse_allowed => sub {return 1},
   });
  
  my ($results, $info) = $aa_parser->parse_and_evaluate('aab');
  #$info->{unparsed} contains 'b'; without end_of_parse_allowed no match

If the parse_this parameter to parse_and_evaluate is a reference to a string, after a parse, the reference will contain the unparsed string.

  my $x = 'aabb';
  my $y = $aa_parser->parse_and_evaluate($x);
  #$x will contain 'aabb', $y contains 'aa'
  $y = $aa_parser->parse_and_evaluate(\$x);
  #$x will contain 'bb', $y contains 'aa'
  $x = 'aabb';
  $y = $aa_parser->parse_and_evaluate({parse_this => \$x});
  #$x will contain 'bb', $y contains 'aa'

PARSING NON-STRINGS

Four subroutines can be provided: an increasing_value function for ensuring parsing is proceeding correctly, a leaf rule matcher/modifier for when the parser is moving forward, a 'leaf' rule unmodifier for when the parser is backtracking, and a display_value function for what is being parsed. One can parse non-strings this way.

  my $object_parser = new Parse::Stallion({
    ...
    parse_forward =>
     sub {
       my $object_ref = shift;
       my $parameters = shift;
       ...
       return ($true_if_object_matches_rule,
        $value_to_store_in_leaf_node);
     },
    parse_backtrack =>
     sub {
       my ($object_ref, $rules, $value_stored_in_leaf_node) = @_;
       ...
      },
    increasing_value_function =>
     sub {
       my $object = shift;
       ...
       return $value_of_object;
     },
    display_value_function => # for parse_trace
     sub {
      my $object = shift;
      ...
      return "Value of object now ...";
     }
  });

When evaluating the parse tree, the parameters to the leaf nodes are the values returned in parse_foward, $value_to_store_in_leaf_node.

'LEAF' LEAF PARSE FORWARD/BACKTRACK

All leaf rules need to be set up such that when the parser is moving forward and reaches a 'leaf', the 'leaf' rule attempts to match the current input object. If there is a match, then the input object is modified to the object's next state and a value is stored to be called upon later during tree evaluation.

When backtracking, the input object should be reverted to the state before being matched by the 'leaf' rule. This requirement can be overridden by setting the backtrack_can_change_value parameter when setting up a new Parse::Stallion.

In parsing a string, substrings are removed from the beginning of the string and reattached to the beginning when backtracked.

INCREASING_VALUE FUNCTION

A function, called 'increasing_value', must be provided that takes the object being parsed and returns a numeric value that either is unchanged or increases after the 'leaf' rule's match_and_modify_input_object is called.

This function is used to detect and prevent "left recursion" by not allowing a non-leaf rule to repeat at the same value. 'Multiple' rules are prevented from repeating more than once at the same value.

The function also speeds up parsing, cutting down on the number of steps by not repeating dead-end parses. If during the parse, the same rule is attempted a second time on the parse object with the same increasing_value, and the first parse did not succeed, then Stallion will note that the parsing was blocked before and begin backtracking.

In parsing a input string, the negative of the length of the input string is used as the increasing function.

STRINGS

By default, strings are matched, which is similar to

  my $calculator_stallion = new Parse::Stallion({
    ...
    parse_forward =>
     sub {
      my $input_string_ref = shift;
      my $rule_definition = shift;
      my $match_rule = $rule_definition->{regex_match} ||
       $rule_definition->{leaf} ||
       $rule_definition->{l};
      if ($$input_string_ref =~ /\A($match_rule)/) {
        my $matched = $1;
        my $not_match_rule = $rule_definition->{regex_not_match};
        if ($not_match_rule) {
          if (!($$input_string_ref =~ /\A$not_match_rule/)) {
            return (0, undef);
          }
        }
        $$input_string_ref = substr($$input_string_ref, length($matched));
        return (1, $matched);
      }
      return 0;
     },

    parse_backtrack =>
     sub {
      my $input_string_ref = shift;
      my $rule_definition = shift;
      my $stored_value = shift;
      if (defined $stored_value) {
        $$input_string_ref = $stored_value.$$input_string_ref;
      }
     },

    increasing_value_function => sub {
      my $string = shift;
      return 0 - length($string);
    },

    display_value_function => sub display_value {
      my $value = shift;
      return $value;
    }

  });

PERL Requirements

Parse::Stallion's installation uses Test::More and Time::Local requiring perl 5.6 or higher. Parse::Stallion should work with earlier versions of perl and neither of those modules is really required.

AUTHOR

Arthur Goldstein, <arthur@acm.org>

COPYRIGHT AND LICENSE

Copyright (C) 2007-8 by Arthur Goldstein

BUGS

Please email in bug reports.

TO DO AND FUTURE POSSIBLE CHANGES

Please send in suggestions.

SEE ALSO

example directory

Parsing texts, including References to Extended Backus-Naur Form notation.

Parse::Stallion::CSV and Parse::Stallion::CSVFH.

Perl 6 grammars.

lex, yacc, ..., other parsers.