The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Parse::Stallion - Backtracking parser with during or after evaluation

SYNOPSIS

  use Parse::Stallion;

  my %rules = (rule_name_1 => ..rule_definition..,
   rule_name_2 => ..rule_definition..,
   ...);

  my $stallion = new Parse::Stallion({
    rules_to_set_up_hash => \%rules,
    start_rule => 'rule_name_1', #default is the rule which is not a subrule
    do_evaluation_in_parsing => 0, #default is 0
    remove_white_space => 1, #default is 0
  });

  my $result = $stallion->parse_and_evaluate($given_string);

  my ($value, $parse_info) =
   $stallion->parse_and_evaluate({parse_this => $s, max_steps => 10});

Rule Definitions:

  AND('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))

  OR('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))

  MULTIPLE('subrule_1', EVALUATION(sub{...}))

  LEAF(qr/regex/, EVALUATION(sub{...}))

DESCRIPTION

Stallion parses and evaluates a string using entered grammar rules. The parsing is done top-down via an initial start rule, in a depth first search forming a parse tree. When a rule does not match the parser backtracks to a node that has another option.

For evaluating a tree node, the evaluation subroutine is given a reference to a hash representing the returned values of the child nodes. The evaluation may be done while creating the parse tree and reject a match affecting which strings parse. This allows complex grammars.

If the evaluation is not done while parsing, on a successful parse, the tree is evaluated in bottom up, left to right order.

The grammars recognized are context free and are similar to those expressed in Extended Backus-Naur Form.

The object being parsed does not need to be a string. Except for the section on non-strings, the documentation assumes strings are being parsed.

COMPLETE EXAMPLES

The following examples read in two unsigned integers and adds them.

  use Parse::Stallion;

   my %basic_grammar = (
    expression =>
     AND('number',
      qr/\s*\+\s*/,
      'number',
      EVALUATION(
       sub {return $_[0]->{number}->[0] + $_[0]->{number}->[1]})
    ),
    number => LEAF(qr/\d+/,
      E(sub{return 0 + $_[0];}))
     #0 + $_[0] converts the matched string into a number
   );

   my $parser = new Parse::Stallion(
   {rules_to_set_up_hash => \%basic_grammar});

   my $result = $parser->parse_and_evaluate('7+4');
   #$result should contain 11

   my %grammar_2 = (
    expression =>
     A('number',
      qr/\s*\+\s*/,
      {right_number => 'number'},
      E(sub {return $_[0]->{number} + $_[0]->{right_number}})
    ),
    number => L(qr/\d+/,
      EVALUATION(sub{return 0 + $_[0];}))
   );

   my $parser_2 = new Parse::Stallion(
   {rules_to_set_up_hash => \%grammar_2, start_rule => 'expression'});

   my $result_2 = $parser_2->parse_and_evaluate('8+5');
   #$result_2 should contain 13

RULES

There are 4 rule types: 'LEAF', 'AND', 'OR', and 'MULTIPLE'.

Parsing begins from the 'start_rule', if the 'start_rule' parameter is omitted, the rule which is not a subrule is used as the start rule. The start rule can be of any type, though if the start rule is a 'LEAF', the grammar is essentially just a regular expression.

After a successful parse, the external nodes correspond to the substrings that the 'LEAF' rules matched; the other rule types correspond to the internal nodes.

LEAF

A 'LEAF' rule contains a regexp that must match the beginning part of the remaining input string, internally \A is prepended to the regexp. During parsing, when a 'LEAF' matches, the matched substring is removed from the input string, though reattached if backtracking occurs. The text LEAF is optional, regexp's are assumed to be leaves, but then there can be no EVALUATION routine for that leaf.

Examples (equivalent):

  LEAF(qr/xx\w+/)

  L(qr/xx\w+/)

  qr/xx\w+/

would match any perl word (\w+) starting with "xx".

AND

An 'AND' rule contains a list of subrules that must be completely matched, from left to right, for the 'and' rule to match.

Examples (equivalent):

  AND('rule_1', 'rule_2', 'rule_3')

  A('rule_1', 'rule_2', 'rule_3')

OR

An 'OR' rule contains a list of subrules, one of which must be matched for the 'OR' rule to match.

During parsing, the subrules are attempted to be matched left to right. If a subrule matches and then is subsequently backtracked, the parser will try to match the next subrule.

Examples (equivalent):

  OR('rule_1', 'rule_2', 'rule_3')

  O('rule_1', 'rule_2', 'rule_3')

MULTIPLE (and OPTIONAL)

A 'MULTIPLE' rule matches if its subrule matches repeatedly between a minimum and maximum number of times. The first parameter to 'MULTIPLE' is the subrule, the next 2 optional parameters are the minium and maximum repititions allowed. The default minimum is 0 and the default maximum is "infinite", though this is represented by setting the maximum to 0.

For there to be another repetition, the input string must have been shortened, else it would be considered an illegal form of "left recursion".

By default the maximal number of possible matches of the repeating rule are tried and then if backtracking occurs, the number of matches is decremented. If the parameter 'MATCH_MIN_FIRST' is passed in, the minimal number of matches is tried first and the number of matches increases when backtracking.

Examples (equivalent):

  MULTIPLE('rule_1')

  M('rule_1')

  M('rule_1', 0, 0)

One can label a rule with the value 'OPTIONAL' that maps to a 'MULTIPLE' rule with minimum 0 and maximum 1.

Examples (equivalent):

  OPTIONAL('rule_1')

  ZERO_OR_ONE('rule_1')

  Z('rule_1')

  MULTIPLE('rule_1', 0, 1)

SIMILARITY BETWEEN RULE TYPES.

The following rules all parse tree-wise equivalently.

  AND('subrule')

  O('subrule')

  MULTIPLE('subrule', 1, 1)

  M('subrule', 1, 1, MATCH_MIN_FIRST)

NESTED RULES

Rules can be nested inside of other rules. See the section EVALUATION for how nested rules affect tree evaluations.

Example:

  sum => AND('number', MULTIPLE(AND('plus', 'number')));

is equivalent, parsing wise, to

  sum => A('number', 'plus_numbers');
  plus_numbers = M('plus_number');
  plus_number => A('plus', 'number');

ALIASES

One can also use an alias for a rule by a hash reference with a single key/value pair. This sets the name to the key when evaluating the parsed expression and parsing the subrule specified by the value.

  adding =  A(
   'number', L(qr/\s*[+]\s*/),
     {right_number => 'number'},
   E(sub {return $_[0]->{number} + $_[0]->{right_number}})
  );

RULE NAMES

Avoid naming rules with the 'separator' substring '__XZ__', to avoid confliciting with internally generated names. One can change this by using the 'separator' parameter.

ENSURING RULES FORM COMPLETE GRAMMAR

Stallion ensures that a grammar is complete and croaks if the given grammar has any rules not reachable from the start rule or if within any rule a subrule does not exist.

PARSE_AND_EVALUATE

After setting up a Parse::Stallion parser, strings are parsed and evaluated via parse_and_evaluate. In scalar context, the returned value is the returned value of the root node's evaluation routine.

RETURNED VALUES

If parse_and_evaluate is called in wantarray context a two value array is returned. The first value is the result of the evaluation, same as if in scalar. The second value is information on the parse.

  my ($value, $parse_info) =
   $stallion->parse_and_evaluate($given_string);

  $parse_info->{parse_succeeded}; # is 1 if the string parses, else 0.
  $parse_info->{number_of_steps}; # number of steps parsing took
  $parse_info->{start_rule};

  $parse_info->{parse_trace};
  # reference to array of hashes showing each step, the hash keys are
  #  1) rule_name
  #  2) moving_forward (value 0 if backtracking),
  #  3) moving_down (value 0 if moving up parse tree)
  #  4) value (display_value_function of parsed object,
  #     by default the yet unparsed portion of the input string)
  #  5) node_creation_step, uniquely identifies node in parse tree
  #  6) parent_node_creation_step, parent in parse tree
  #  7) informative message of most recent parse step

  $parse_info->{tree}; # the parse tree if the string parses.

The tree is an Parse::Stallion object having a function, that converts a tree into a string, each node consisting of one line:

  $parse_info->{tree}->stringify({values=>['name','parse_match']});

Internally generated node names, from rules generated by breaking up the entered rules into subrules, will show up. The module Parse::Stallion::EBNF shows the grammar with these generated subrules.

NUMBER OF PARSE STEPS

If the parsing reaches the maximum number of steps without completing a parse tree, the parse fails. Each step is listed in the parse_trace.

A step is an action on a node, roughly speaking matching a regex for a 'leaf' node, or moving forward or backtracking from a node.

The maximum number of steps can be changed, default 20,000:

  $stallion->parse_and_evaluate({max_steps=>100000, parse_this=>$string});

"PARSE_FUNCTION"

One can pass in a subroutine parse_function that is executed at every parse step and takes the current object as the parameter and its returned value is the new current object.

   my $step = 1;
   my $parser = new Parse::Stallion(
   {rules_to_set_up_hash => \%basic_grammar, start_rule => 'expression',
    parse_function => sub {my $o = shift;print "step ".$step++."\n";return $o}
   });

"LEFT RECURSION"

Parse::Stallion may encounter "left recursiveness" during parsing in which case the parsing stops and a message is 'croak'ed.

"Left recursion" occurs during parsing when the same non-'leaf' rule shows up a second time on the parse tree with the "same" input string, measured by the increasing_value_function.

Illegal Case 1:

     expression => AND('expression', 'plus', 'term')

Illegal Case 2:

     rule_with_empty => AND('empty', 'rule_with_empty', 'other_rule')
     empty => qr//

Illegal Case 3:

     rule_with_optional => AND('nothing', 'optional_rule', 'nothing')
     nothing => AND('empty')
     empty => L(qr//)
     optional_rule => OPTIONAL('some_other_rule')
     some_other_rule => qr/x/

The 3rd case will detect left recursiveness if optional_rule does not match and modify the input.

EVALUATION

Evaluation can occur during or after parsing.

Each rule may have an evaluation subroutine tied to it, if when defining a rule there is no subroutine, a default "do nothing" subroutine is provided. Subrules may have a subroutine tied to them though by default they have none. When setting up a rule, one can specify the subroutine by enclosing the subroutine in the parameter 'EVALUATION' or 'E'.

Each node has a computed value that is the result of calling its evaluation routine. The returned value of the parse is the computed value of the root node.

There are two parameters to an evaluation routine. the second is the object being parsed, which could be undef if evaluation occurs after parsing.

The first parameter is either the string matched by the nodes' descendants or a hash. The hash keys are the named subrules of the node's rule, the values are the computed value of the corresponding child node. If a key could repeat, the value is an array reference.

For 'leaf' nodes, the parameter is the string matched and cannot be a hash, there are no children. For non-'leaf' nodes by default the parameter is the hash, this can be changed by passing in 'USE_PARSE_MATCH' when creating the rule.

The beginning and trailing white space can be removed before being passed to a 'leaf' or USE_PARSE_MATCH node's routine by setting the parameter remove_white_space when creating the parser:

  $parser = new Parse::Stallion({remove_white_space => 1});

By nesting a rule with an alias, the alias is used for the name of the hash parameter instead of the rule name.

EVALUATION AFTER PARSING

If after parsing, Stallion will evaluate the parse tree in a bottom up left to right traversal.

EVALUATION (AND UNEVALUATION) DURING PARSING

If the do_evaluation_in_parsing is set when a Parse::Stallion object is created the evaluation occurs during the parsing instead of afterwards. Alternatively, if there exists any UNEVALUATION routine, the evaluation is done during parsing.

Every time a node is matched, its evaluation routine is called with a hash as it would be during evaluation after parsing. A second parameter is also passed in which corresponds to the current object being parsed. This allows look ahead.

The evaluation routine may return a second parameter that tells Parse::Stallion to reject or not reject the match. This allows more control over what can be parsed.

The same node may be evaluated several times due to backtracking. One must be careful either to not change the references/objects passed in to the evaluation routine or to undo changes in an unevaluation routine. An unevaluation routine is called when backtracking reverts back to a node, the parameters are the same as for the evaluation routine. UNEVALUATION (alternatively U) are used to set the subroutine.

The following example shows a 'leaf' rule that matches all words except for those marked in the hash %keywords:

  our %keywords = (...);
  my %grammar = (...
   leaf => L(
     qr/\w+/,
     E(sub {if ($keywords{$_[0]}) {return (undef, 1)} return $_[0]}),
     U(), #forces do_evaluation_in_parsing
   ),
  ...
  );

EVALUATION DURING PARSING EXAMPLE

In this example, the first statement tells a truth about the number of elements in a list, the second tells whether or not the first statement is true. If the second statement is true, the string parses.

  my %parsing_rules = (
   start_expression =>
    AND('two_statements', qr/\z/,
     E(sub {return $_[0]->{'two_statements'}})
   ),
   two_statements =>
     A('list_statement','truth_statement',
      EVALUATION(sub {
       if ($_[0]->{list_statement} != $_[0]->{truth_statement}) {
         return (undef, 1);
       }
       return 1;
     })
   ),
   list_statement =>
     A('count_statement', 'list',
      E(sub {
       if ($_[0]->{count_statement} == scalar(@{$_[0]->{list}})) {
         return 1;
       }
       return 0;
     })
   ),
   count_statement =>
     A(qr/there are /i,'number',L(qr/ elements in /),
      E(sub { return $_[0]->{number}; })
    ),
   number =>
    L(qr/\d+/,
     E(sub { return 0 + shift; })
   ),
   list => A('number', M(A(qr/\,/}, 'number')),
     EVALUATION(sub {return $_[0]->{number}})
   ),
   truth_statement =>
     O({t => qr/\. that is the truth\./},
      {t=>qr/\. that is not the truth\./},
      E(sub {
       if ($_[0]->{t} =~ /not/) {
         return 0;
       }
       return 1;
     })
   ),
  );
  
  my $how_many_parser = new Parse::Stallion({
    do_evaluation_in_parsing => 1,
    rules_to_set_up_hash => \%parsing_rules,
    start_rule => 'start_expression',
  });
  
  $result = $how_many_parser->parse_and_evaluate(
    "there are 5 elements in 5,4,3,2,1. that is the truth.");
  
  print "$result should be 1\n";
  
  $result = $how_many_parser->parse_and_evaluate(
    "there are 5 elements in 5,4,3,1. that is not the truth.");
  
  print "$result should be 1\n";
  
  $result = $how_many_parser->parse_and_evaluate(
    "there are 5 elements in 5,4,3,1. that is the truth.");
  
  print "$result should be undef\n";

DEFAULT EVALUATION ROUTINE

If a rule does not have an evaluation routine specified, a generic subroutine is used. Two routines are provided:

Generic Evaluation Routine 1

  • If the passed in hash reference has only one key, then the value of that key in the hash reference is returned.

  • If the passed in hash reference has more than one key, then the hash reference is returned.

This is the routine:

  sub {
    my $parameters = shift;
    if (ref $parameters eq 'HASH') {
      if (keys %$parameters == 1) {
        my ($key) = keys %$parameters;
        return $parameters->{$key};
      }
    }
    return $parameters;
  }

Generic Evaluation Routine 2

The second default evaluation routine will be used if do_not_compress_eval is set when creating the parser. It does not remove the names of the rules and the code is simply:

  sub return_self_routine {
    return shift;
  }

Example:

  my %no_eval_rules = (
   start_rule => A('term',
    M(A({plus=>qr/\s*\+\s*/}, 'term'))),
   term => A({left=>'number'},
    M(A({times=>qr/\s*\*\s*/},
     {right=>'number'}))),
   number => qr/\s*\d*\s*/
  );                               
                      
  my $no_eval_parser = new Parse::Stallion({do_not_compress_eval => 0,
   rules_to_set_up_hash => \%no_eval_rules,
   });

  $result = $no_eval_parser->parse_and_evaluate({parse_this=>"7+4*8"});
   
  #$result contains:
  { 'plus' => [ '+' ],
    'term' => [ '7',
                { 'left' => '4',
                  'right' => [ '8' ],
                  'times' => [ '*' ] } ] }

  my $dnce_no_eval_parser = new Parse::Stallion({do_not_compress_eval => 1,
   rules_to_set_up_hash => \%no_eval_rules,
   });

  $result = $dnce_no_eval_parser->parse_and_evaluate({parse_this=>"7+4*8"});

  #$result contains:
  { 'plus' => [ '+' ],
    'term' => [ { 'left' => '7' },
                { 'left' => '4',
                  'right' => [ '8' ],
                  'times' => [ '*' ] } ] }

Parameter types to Evaluation Routines

The routine which_parameters_are_arrays returns a hash of the possible values passed to an evaluation routine. For a given key, if the value is 1, the key would be passed to the evaluation routine as an array, if the value is 0, it would be passed as a scalar.

MORE COMPLICATED EXAMPLE

  The following is a simple calculator:

   %calculator_rules = (
    start_expression => A(
      'expression', qr/\z/,
      E(sub {return $_[0]->{expression}})
     ),
    expression => A(
      'term', 
       M(A('plus_or_minus', 'term')),
      E(sub {my $to_combine = $_[0]->{term};
       my $plus_or_minus = $_[0]->{plus_or_minus};
       my $value = $to_combine->[0];
       for my $i (1..$#{$to_combine}) {
         if ($plus_or_minus->[$i-1] eq '+') {
           $value += $to_combine->[$i];
         }
         else {
           $value -= $to_combine->[$i];
         }
       }
       return $value;
      })
     ),
    term => A(
      'number', 
       M(A('times_or_divide', 'number')),
       E(sub {my $to_combine = $_[0]->{number};
       my $times_or_divide = $_[0]->{times_or_divide};
       my $value = $to_combine->[0];
       for my $i (1..$#{$to_combine}) {
         if ($times_or_divide->[$i-1] eq '*') {
           $value *= $to_combine->[$i];
         }
         else {
           $value /= $to_combine->[$i]; #does not check for zero
         }
       }
       return $value;
      })
    ),
    number => L(
      qr/\s*[+\-]?(\d+(\.\d*)?|\.\d+)\s*/,
      E(sub{ return 0 + $_[0]; })
    ),
    plus_or_minus => qr/\s*[\-+]\s*/,
    times_or_divide => qr/\s*[*\/]\s*/
   );

   $calculator_parser = new Parse::Stallion({
     rules_to_set_up_hash => \%calculator_rules,
     start_rule => 'start_expression'
   });

   my $result = $calculator_parser->parse_and_evaluate("3+7*4");
   # $result should contain 31

  my $array_p = $calculator_parser->which_parameters_are_arrays({
    rule_name => 'term'});
  # $array_p would be {number => 1, times_or_divide => 1}

  $array_p = $calculator_parser->which_parameters_are_arrays({
    rule_name => 'start_expression'});
  # $array_p would be {expression => 0}

PARSING NON-STRINGS

In order to parse non-strings, use NON_LEAF_STRING (or N) instead of LEAF (or L) for defining the leaves.

Four subroutines should be provided: an increasing_value function for ensuring parsing is proceeding correctly, a 'leaf' rule matcher/modifier for when the parser is moving forward, a 'leaf' rule unmodifier for when the parser is backtracking, an increasing_value_function to prevent "left recursion", and a display_value function for the parse_trace.

Parsing is completed only if the object being parsed becomes undefined or equal to ''. The latter is the condition that parsing strings must match.

  my $object_parser = new Parse::Stallion({
    ...
    parse_forward =>
     sub {
       my $object_ref = shift;
       my $parameters = shift;
       ...
       return ($true_if_object_matches_rule,
        $value_to_store_in_leaf_node);
     },
    parse_backtrack =>
     sub {
       my ($object_ref, $rules, $value_stored_in_leaf_node) = @_;
       ...
      },
    increasing_value_function =>
     sub {
       my $object = shift;
       ...
       return $value_of_object;
     },
    display_value_function => # for parse_trace
     sub {
      my $object = shift;
      ...
      return "Value of object now ...";
     }
  });

When evaluating the parse tree, the parameters to the 'leaf' nodes are the values returned in parse_forward, $value_to_store_in_leaf_node. These values are joined together for parse_match.

The script object_string.pl in the example directory shows how to use this.

'LEAF' LEAF PARSE FORWARD/BACKTRACK

All 'leaf' rules need to be set up such that when the parser is moving forward and reaches a 'leaf', the 'leaf' rule attempts to match the current input object. If there is a match, then the input object is modified to the object's next state and a value is stored to be called upon later during tree evaluation.

When backtracking, the object being parsed should be reverted to the state before being matched by the 'leaf' rule.

In parsing a string, substrings are removed from the beginning of the string and reattached to the beginning when backtracked.

INCREASING_VALUE FUNCTION

A function, called 'increasing_value', must be provided that takes the object being parsed and returns a numeric value that either is unchanged or increases after the 'leaf' rule's match_and_modify_input_object is called.

This function is used to detect and prevent "left recursion" by not allowing a non-'leaf' rule to repeat at the same value. 'Multiple' rules are prevented from repeating more than once at the same value.

The function also cuts down on the number of steps by allowing the parser to not repeat dead-end parses. If during the parse, the same rule is attempted a second time on the parse object with the same increasing_value, and the first parse did not succeed, the parser will beging backtracking.

In parsing a string, the negative of the length of the input string is used as the increasing function.

STRINGS

By default, strings are matched, which is similar to

  my $calculator_stallion = new Parse::Stallion({
    ...
    parse_forward =>
     sub {
      my $input_string_ref = shift;
      my $rule_definition = shift;
      my $m = $rule_definition->{regex_match}; #regex_match eq regexp of leaf
      if ($$input_string_ref =~ s/\A$m//) {
        return (1, $&);
      }
      return 0;
     },

    parse_backtrack =>
     sub {
      my $input_string_ref = shift;
      my $rule_definition = shift;
      my $stored_value = shift;
      if (defined $stored_value) {
        $$input_string_ref = $stored_value.$$input_string_ref;
      }
     },

    increasing_value_function => sub {
      my $string = shift;
      return 0 - length($string);
    },

    display_value_function => sub display_value {
      my $value = shift;
      return $value;
    }

  });

EXPORT

The following are EXPORTED from this module:

A AND O OR LEAF L M MULTIPLE NON_STRING_LEAF N OPTIONAL ZERO_OR_ONE Z E EVALUATION U UNEVALUATION MATCH_MIN_FIRST USE_PARSE_MATCH

PERL Requirements

Parse::Stallion's installation uses Test::More and Time::Local requiring perl 5.6 or higher. Parse::Stallion should work with earlier versions of perl, neither of those modules is required outside of the test cases for installation.

AUTHOR

Arthur Goldstein, <arthur@acm.org>

COPYRIGHT AND LICENSE

Copyright (C) 2007-8 by Arthur Goldstein

BUGS

Please email in bug reports.

TO DO AND FUTURE POSSIBLE CHANGES

Please send in suggestions.

SEE ALSO

example directory

Parsing texts, including references to Extended Backus-Naur Form notation.

Parse::Stallion::CSV, Parse::Stallion::CSVFH, Parse::Stallion::EBNF.

Perl 6 grammars.

lex, yacc, ..., other parsers.