NAME
Parse::Stallion - Backtracking parser with during or after evaluation
SYNOPSIS
use Parse::Stallion;
my %rules = (rule_name_1 => ..rule_definition..,
rule_name_2 => ..rule_definition..,
...);
my $stallion = new Parse::Stallion({
rules_to_set_up_hash => \%rules,
start_rule => 'rule_name_1', #default is the rule which is not a subrule
do_evaluation_in_parsing => 0, #default is 0
remove_white_space => 1, #default is 0
max_steps => 20000, #default is 20000
});
my $result = $stallion->parse_and_evaluate($given_string);
my ($value, $parse_info) =
$stallion->parse_and_evaluate({parse_this => $s});
Rule Definitions:
AND('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))
OR('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))
MULTIPLE('subrule_1', EVALUATION(sub{...}))
LEAF(qr/regex/, EVALUATION(sub{...}))
DESCRIPTION
Stallion parses and evaluates a string using entered grammar rules. The parsing is done top-down via an initial start rule, in a depth first search forming a parse tree. When a rule does not match the parser backtracks to a node that has another option.
For evaluating a tree node, the evaluation subroutine is given a reference to a hash representing the returned values of the child nodes. The evaluation may be done while creating the parse tree and reject a match affecting which strings parse. This allows complex grammars.
If the evaluation is not done while parsing, on a successful parse, the tree is evaluated in bottom up, left to right order.
The grammars recognized are context free and are similar to those expressed in Extended Backus-Naur Form.
The object being parsed does not need to be a string. Except for the section on non-strings, the documentation assumes strings are being parsed.
COMPLETE EXAMPLES
The following examples read in two unsigned integers and adds them.
use Parse::Stallion;
my %basic_grammar = (
expression =>
AND('number',
qr/\s*\+\s*/,
'number',
EVALUATION(
sub {return $_[0]->{number}->[0] + $_[0]->{number}->[1]})
),
number => LEAF(qr/\d+/,
E(sub{return 0 + $_[0];}))
#0 + $_[0] converts the matched string into a number
);
my $parser = new Parse::Stallion(
{rules_to_set_up_hash => \%basic_grammar});
my $result = $parser->parse_and_evaluate('7+4');
#$result should contain 11
my %grammar_2 = (
expression =>
A('number',
qr/\s*\+\s*/,
{right_number => 'number'},
E(sub {return $_[0]->{number} + $_[0]->{right_number}})
),
number => L(qr/\d+/,
EVALUATION(sub{return 0 + $_[0];}))
);
my $parser_2 = new Parse::Stallion(
{rules_to_set_up_hash => \%grammar_2, start_rule => 'expression'});
my $result_2 = $parser_2->parse_and_evaluate('8+5');
#$result_2 should contain 13
RULES
There are 4 rule types: 'LEAF', 'AND', 'OR', and 'MULTIPLE'.
Parsing begins from the 'start_rule', if the 'start_rule' parameter is omitted, the rule which is not a subrule is used as the start rule. The start rule can be of any type, though if the start rule is a 'LEAF', the grammar is essentially just a regular expression.
After a successful parse, the external nodes correspond to the substrings that the 'LEAF' rules matched; the other rule types correspond to the internal nodes.
LEAF
A 'LEAF' rule contains a regexp that must match the beginning part of the remaining input string, internally \A is prepended to the regexp. During parsing, when a 'LEAF' matches, the matched substring is removed from the input string, though reattached if backtracking occurs. The text LEAF is optional, regexp's are assumed to be leaves, but then there can be no EVALUATION routine for that leaf.
If the regexp matches but is empty, the matched value is set to an empty string as opposed to an undefined value.
Examples (equivalent):
LEAF(qr/xx\w+/)
L(qr/xx\w+/)
qr/xx\w+/
would match any perl word (\w+) starting with "xx".
See the section 'LEAF DETAILS' for other ways to handle leaves.
AND
An 'AND' rule contains a list of subrules that must be completely matched, from left to right, for the 'and' rule to match.
Examples (equivalent):
AND('rule_1', 'rule_2', 'rule_3')
A('rule_1', 'rule_2', 'rule_3')
OR
An 'OR' rule contains a list of subrules, one of which must be matched for the 'OR' rule to match.
During parsing, the subrules are attempted to be matched left to right. If a subrule matches and then is subsequently backtracked, the parser will try to match the next subrule.
Examples (equivalent):
OR('rule_1', 'rule_2', 'rule_3')
O('rule_1', 'rule_2', 'rule_3')
MULTIPLE (and OPTIONAL)
A 'MULTIPLE' rule matches if its subrule matches repeatedly between a minimum and maximum number of times. The first parameter to 'MULTIPLE' is the subrule, the next 2 optional parameters are the minium and maximum repititions allowed. The default minimum is 0 and the default maximum is "infinite", though this is represented by setting the maximum to 0.
For there to be another repetition, the input string must have been shortened, else it would be considered an illegal form of "left recursion".
By default the maximal number of possible matches of the repeating rule are tried and then if backtracking occurs, the number of matches is decremented. If the parameter 'match_min_first' is passed in, the minimal number of matches is tried first and the number of matches increases when backtracking.
Examples (equivalent):
MULTIPLE('rule_1')
M('rule_1')
M('rule_1', 0, 0)
One can label a rule with the value 'OPTIONAL' that maps to a 'MULTIPLE' rule with minimum 0 and maximum 1.
Examples (equivalent):
OPTIONAL('rule_1')
ZERO_OR_ONE('rule_1')
Z('rule_1')
MULTIPLE('rule_1', 0, 1)
to get one or more:
MULTIPLE('rule_2', 1, 0)
SIMILARITY BETWEEN RULE TYPES.
The following rules all parse tree-wise equivalently.
AND('subrule')
O('subrule')
MULTIPLE('subrule', 1, 1)
M('subrule', 1, 1, 'match_min_first')
NESTED RULES
Rules can be nested inside of other rules. See the section EVALUATION for how nested rules affect tree evaluations.
Example:
sum => AND('number', MULTIPLE(AND('plus', 'number')));
is equivalent, parsing wise, to
sum => A('number', 'plus_numbers');
plus_numbers = M('plus_number');
plus_number => A('plus', 'number');
ALIASES
One can also use an alias for a rule by a hash reference with a single key/value pair. This sets the name to the key when evaluating the parsed expression and parsing the subrule specified by the value.
adding = A(
'number', L(qr/\s*[+]\s*/),
{right_number => 'number'},
E(sub {return $_[0]->{number} + $_[0]->{right_number}})
);
RULE NAMES
Avoid naming rules with the 'separator' substring '__XZ__', to avoid confliciting with internally generated rule names. One can change this by using the 'separator' parameter.
ENSURING RULES FORM COMPLETE GRAMMAR
Stallion ensures that a grammar is complete and croaks if the given grammar has any rules not reachable from the start rule or if within any rule a subrule does not exist.
PARSE_AND_EVALUATE
After setting up a Parse::Stallion parser, strings are parsed and evaluated via parse_and_evaluate. In scalar context, the returned value is the returned value of the root node's evaluation routine.
RETURNED VALUES
If parse_and_evaluate is called in wantarray context a two value array is returned. The first value is the result of the evaluation, same as if in scalar. The second value is information on the parse.
my ($value, $parse_info) =
$stallion->parse_and_evaluate($given_string);
$parse_info->{parse_succeeded}; # is 1 if the string parses, else 0.
$parse_info->{number_of_steps}; # number of steps parsing took
$parse_info->{start_rule};
$parse_info->{parse_trace};
# reference to array of hashes showing each step, the hash keys are
# 1) rule_name
# 2) moving_forward (value 0 if backtracking),
# 3) moving_down (value 0 if moving up parse tree)
# 4) value (length of string parsed or from increasing_value_function)
# 5) node_creation_step, uniquely identifies node in parse tree
# 6) parent_node_creation_step, parent in parse tree
# 7) informative message of most recent parse step
$parse_info->{tree}; # the parse tree if the string parses.
The tree is an Parse::Stallion object having a function, that converts a tree into a string, each node consisting of one line:
$parse_info->{tree}->stringify({values=>['name','parse_match']});
Internally generated node names, from rules generated by breaking up the entered rules into subrules, will show up. The module Parse::Stallion::EBNF shows the grammar with these generated subrules.
NUMBER OF PARSE STEPS
If the parsing reaches the maximum number of steps without completing a parse tree, the parse fails. Each step is listed in the parse_trace.
A step is an action on a node, roughly speaking matching a regex for a 'leaf' node, or moving forward or backtracking from a node.
The maximum number of steps can be changed, default 20,000:
$stallion->parse_and_evaluate({max_steps=>100000, parse_this=>$string});
"LEFT RECURSION"
Parse::Stallion may encounter "left recursiveness" during parsing in which case the parsing stops and a message is 'croak'ed.
"Left recursion" occurs during parsing when the same non-'leaf' rule shows up a second time on the parse tree with the at the same position in the input string.
Illegal Case 1:
expression => AND('expression', 'plus', 'term')
Illegal Case 2:
rule_with_empty => AND('empty', 'rule_with_empty', 'other_rule')
empty => qr//
Illegal Case 3:
rule_with_optional => AND('nothing', 'optional_rule', 'nothing')
nothing => AND('empty')
empty => L(qr//)
optional_rule => OPTIONAL('some_other_rule')
some_other_rule => qr/x/
The 3rd case will detect left recursiveness if optional_rule does not match and modify the input.
EVALUATION
Evaluation can occur during or after parsing.
Each rule may have an evaluation subroutine tied to it, if when defining a rule there is no subroutine, a default "do nothing" subroutine is provided. Subrules may have a subroutine tied to them though by default they have none. When setting up a rule, one can specify the subroutine by enclosing the subroutine in the parameter 'EVALUATION' or 'E'.
Each node has a computed value that is the result of calling its evaluation routine. The returned value of the parse is the computed value of the root node.
There are two parameters to an evaluation routine. the second is the object being parsed, which could be undef if evaluation occurs after parsing.
The first parameter is either the string matched by the nodes' descendants or a hash. The hash keys are the named subrules of the node's rule, the values are the computed value of the corresponding child node. If a key could repeat, the value is an array reference.
For 'leaf' nodes, the parameter is the string matched and cannot be a hash, there are no children. For non-'leaf' nodes by default the parameter is the hash, this can be changed by passing in 'USE_PARSE_MATCH()' when creating the rule.
The beginning and trailing white space can be removed before being passed to a 'leaf' or USE_PARSE_MATCH node's routine by setting the parameter remove_white_space when creating the parser:
$parser = new Parse::Stallion({remove_white_space => 1});
By nesting a rule with an alias, the alias is used for the name of the hash parameter instead of the rule name.
EVALUATION AFTER PARSING
If after parsing, Stallion will evaluate the parse tree in a bottom up left to right traversal.
EVALUATION (AND UNEVALUATION) DURING PARSING
If the do_evaluation_in_parsing is set when a Parse::Stallion object is created the evaluation occurs during the parsing instead of afterwards. Alternatively, if there exists any UNEVALUATION routine, the evaluation is done during parsing.
Every time a node is matched, its evaluation routine is called with a hash as it would be during evaluation after parsing. A second parameter is also passed in which corresponds to the current object being parsed. This allows look ahead.
The evaluation routine may return a second parameter that tells Parse::Stallion to reject or not reject the match. This allows more control over what can be parsed.
The same node may be evaluated several times due to backtracking. One must be careful either to not change the references/objects passed in to the evaluation routine or to undo changes in an unevaluation routine. An unevaluation routine is called when backtracking reverts back to a node, the parameters are the same as for the evaluation routine. UNEVALUATION (alternatively U) are used to set the subroutine.
The following example shows a 'leaf' rule that matches all words except for those marked in the hash %keywords:
our %keywords = (...);
my %grammar = (...
leaf => L(
qr/\w+/,
E(sub {if ($keywords{$_[0]}) {return (undef, 1)} return $_[0]}),
U(), #forces do_evaluation_in_parsing
),
...
);
EVALUATION DURING PARSING EXAMPLE
In this example, the first statement tells a truth about the number of elements in a list, the second tells whether or not the first statement is true. If the second statement is true, the string parses.
my %parsing_rules = (
start_expression =>
AND('two_statements', qr/\z/,
E(sub {return $_[0]->{'two_statements'}})
),
two_statements =>
A('list_statement','truth_statement',
EVALUATION(sub {
if ($_[0]->{list_statement} != $_[0]->{truth_statement}) {
return (undef, 1);
}
return 1;
})
),
list_statement =>
A('count_statement', 'list',
E(sub {
if ($_[0]->{count_statement} == scalar(@{$_[0]->{list}})) {
return 1;
}
return 0;
})
),
count_statement =>
A(qr/there are /i,'number',L(qr/ elements in /),
E(sub { return $_[0]->{number}; })
),
number =>
L(qr/\d+/,
E(sub { return 0 + shift; })
),
list => A('number', M(A(qr/\,/}, 'number')),
EVALUATION(sub {return $_[0]->{number}})
),
truth_statement =>
O({t => qr/\. that is the truth\./},
{t=>qr/\. that is not the truth\./},
E(sub {
if ($_[0]->{t} =~ /not/) {
return 0;
}
return 1;
})
),
);
my $how_many_parser = new Parse::Stallion({
do_evaluation_in_parsing => 1,
rules_to_set_up_hash => \%parsing_rules,
start_rule => 'start_expression',
});
$result = $how_many_parser->parse_and_evaluate(
"there are 5 elements in 5,4,3,2,1. that is the truth.");
print "$result should be 1\n";
$result = $how_many_parser->parse_and_evaluate(
"there are 5 elements in 5,4,3,1. that is not the truth.");
print "$result should be 1\n";
$result = $how_many_parser->parse_and_evaluate(
"there are 5 elements in 5,4,3,1. that is the truth.");
print "$result should be undef\n";
DEFAULT EVALUATION ROUTINE
If a rule does not have an evaluation routine specified, a generic subroutine is used. Two routines are provided:
Generic Evaluation Routine 1
If the passed in hash reference has only one key, then the value of that key in the hash reference is returned.
If the passed in hash reference has more than one key, then the hash reference is returned.
This is the routine:
sub {
my $parameters = shift;
if (ref $parameters eq 'HASH') {
if (keys %$parameters == 1) {
my ($key) = keys %$parameters;
return $parameters->{$key};
}
}
return $parameters;
}
Generic Evaluation Routine 2
The second default evaluation routine will be used if do_not_compress_eval is set when creating the parser. It is simply:
sub return_self_routine {
return shift;
}
Example:
my %no_eval_rules = (
start_rule => A('term',
M(A({plus=>qr/\s*\+\s*/}, 'term'))),
term => A({left=>'number'},
M(A({times=>qr/\s*\*\s*/},
{right=>'number'}))),
number => qr/\s*\d*\s*/
);
my $no_eval_parser = new Parse::Stallion({do_not_compress_eval => 0,
rules_to_set_up_hash => \%no_eval_rules,
});
$result = $no_eval_parser->parse_and_evaluate({parse_this=>"7+4*8"});
#$result contains:
{ 'plus' => [ '+' ],
'term' => [ '7',
{ 'left' => '4',
'right' => [ '8' ],
'times' => [ '*' ] } ] }
my $dnce_no_eval_parser = new Parse::Stallion({do_not_compress_eval => 1,
rules_to_set_up_hash => \%no_eval_rules,
});
$result = $dnce_no_eval_parser->parse_and_evaluate({parse_this=>"7+4*8"});
#$result contains:
{ 'plus' => [ '+' ],
'term' => [ { 'left' => '7' },
{ 'left' => '4',
'right' => [ '8' ],
'times' => [ '*' ] } ] }
Parameter types to Evaluation Routines
The routine which_parameters_are_arrays returns a hash of the possible values passed to an evaluation routine. For a given key, if the value is 1, the key would be passed to the evaluation routine as an array, if the value is 0, it would be passed as a scalar.
MORE COMPLICATED EXAMPLE
The following is a simple calculator:
%calculator_rules = (
start_expression => A(
'expression', qr/\z/,
E(sub {return $_[0]->{expression}})
),
expression => A(
'term',
M(A('plus_or_minus', 'term')),
E(sub {my $to_combine = $_[0]->{term};
my $plus_or_minus = $_[0]->{plus_or_minus};
my $value = $to_combine->[0];
for my $i (1..$#{$to_combine}) {
if ($plus_or_minus->[$i-1] eq '+') {
$value += $to_combine->[$i];
}
else {
$value -= $to_combine->[$i];
}
}
return $value;
})
),
term => A(
'number',
M(A('times_or_divide', 'number')),
E(sub {my $to_combine = $_[0]->{number};
my $times_or_divide = $_[0]->{times_or_divide};
my $value = $to_combine->[0];
for my $i (1..$#{$to_combine}) {
if ($times_or_divide->[$i-1] eq '*') {
$value *= $to_combine->[$i];
}
else {
$value /= $to_combine->[$i]; #does not check for zero
}
}
return $value;
})
),
number => L(
qr/\s*[+\-]?(\d+(\.\d*)?|\.\d+)\s*/,
E(sub{ return 0 + $_[0]; })
),
plus_or_minus => qr/\s*[\-+]\s*/,
times_or_divide => qr/\s*[*\/]\s*/
);
$calculator_parser = new Parse::Stallion({
rules_to_set_up_hash => \%calculator_rules,
start_rule => 'start_expression'
});
my $result = $calculator_parser->parse_and_evaluate("3+7*4");
# $result should contain 31
my $array_p = $calculator_parser->which_parameters_are_arrays({
rule_name => 'term'});
# $array_p would be {number => 1, times_or_divide => 1}
$array_p = $calculator_parser->which_parameters_are_arrays({
rule_name => 'start_expression'});
# $array_p would be {expression => 0}
LEAF DETAILS
Leafs can be set up as follows:
LEAF($leaf_arg, PARSE_FORWARD(sub{...}), PARSE_BACKTRACK(sub{...}),
EVALUATION(sub{...}), UNEVALUATION(sub{...}), DISPLAY($display));
If $leaf_arg is a Regexp, it is converted into a hash ref: {regex_match => $leaf_arg} for internal purposes.
If PARSE_FORWARD and PARSE_BACKTRACK are not provided, they use the default parse_forward and parse_backtrack subroutines.
The subroutine in PARSE_FORWARD (or PF) is called when moving forwards during the parse. It is given 3 arguments, a reference to the object being parsed, $leaf_arg, and the current value. It should return 1 and a "parse match" to store if the parsing should continue forward. Else it should return 0.
The subroutine in PARSE_BACKTRACK (or PB) is called when backtracking through a leaf. It is given 4 arguments: a reference to the object being parsed, $leaf_arg (or {regex_match => $leaf_arg}), the current value, and the "parse match" that was stored when moving forward. It should return false. If it returns true, then the parsing immediately ends in failure. This can be used to set up a rule
pass_this_no_backtrack => L(qr//,PB(sub{return 1}))
that if encountered during parsing means that no backtracking will occur previous to this rule.
The string $display is used in the related module Parse::Stallion::EBNF as to the string to show for the leaf rule.
EVALUATION and UNEVALUATION are explained in the section 'EVALUATION'.
PARSING NON-STRINGS
Three subroutines should be provided: an increasing_value function for ensuring parsing is proceeding correctly, a default 'leaf' rule matcher/modifier for when the parser is moving forward, a default 'leaf' rule unmodifier for when the parser is backtracking, and an increasing_value_function to prevent "left recursion".
Parsing is completed only if the object being parsed becomes undefined or equal to ''. The latter is the condition that parsing strings must match.
my $object_parser = new Parse::Stallion({
...
parse_forward =>
sub {
my ($object_ref, $parameters, $current_value) = @_;
...
return ($true_if_object_matches_rule,
$value_to_store_in_leaf_node);
},
parse_backtrack =>
sub {
my ($object_ref, $rules, $current_value, $value_stored_in_leaf_node)
= @_;
...
},
increasing_value_function =>
sub {
my $object = shift;
...
return $value_of_object;
},
});
When evaluating the parse tree, the parameters to the 'leaf' nodes are the values returned in parse_forward, $value_to_store_in_leaf_node. These values are joined together for parse_match.
The script object_string.pl in the example directory shows how to use this.
'LEAF' LEAF PARSE FORWARD/BACKTRACK
All 'leaf' rules need to be set up such that when the parser is moving forward and reaches a 'leaf', the 'leaf' rule attempts to match the current input object. If there is a match, then the input object is modified to the object's next state and a value is stored to be called upon later during tree evaluation.
When backtracking, the object being parsed should be reverted to the state before being matched by the 'leaf' rule.
In parsing a string, substrings are removed from the beginning of the string and reattached to the beginning when backtracked.
INCREASING_VALUE FUNCTION
A function, called 'increasing_value', must be provided that takes the object being parsed and returns a numeric value that either is unchanged or increases after the 'leaf' rule's match_and_modify_input_object is called.
This function is used to detect and prevent "left recursion" by not allowing a non-'leaf' rule to repeat at the same value. 'Multiple' rules are prevented from repeating more than once at the same value.
The function also cuts down on the number of steps by allowing the parser to not repeat dead-end parses. If during the parse, the same rule is attempted a second time on the parse object with the same increasing_value, and the first parse did not succeed, the parser will begin backtracking.
In parsing a string, the length of the string so far parsed acts as the increasing function.
STRINGS
By default, strings are matched, which is similar to
my $calculator_stallion = new Parse::Stallion({
...
parse_forward =>
sub {
my $input_string_ref = shift;
my $rule_definition = shift;
my $m = $rule_definition->{regex_match}; #regex_match eq regexp of leaf
if ($$input_string_ref =~ s/\A($m)//) {
return (1, $1);
}
return 0;
},
parse_backtrack =>
sub {
my $input_string_ref = shift;
my $rule_definition = shift;
my $stored_value = shift;
if (defined $stored_value) {
$$input_string_ref = $stored_value.$$input_string_ref;
}
return;
},
increasing_value_function => sub {
my $string = shift;
return 0 - length($string);
},
});
EXPORT
The following are EXPORTED from this module:
A AND O OR LEAF L M MULTIPLE OPTIONAL ZERO_OR_ONE Z
E EVALUATION U UNEVALUATION PF PARSE_FORWARD PB PARSE_BACKTRACK
LEAF_DISPLAY USE_PARSE_MATCH
PERL Requirements
Parse::Stallion's installation uses Test::More and Time::Local requiring perl 5.6 or higher. Parse::Stallion should work with earlier versions of perl, neither of those modules is required outside of the test cases for installation.
AUTHOR
Arthur Goldstein, <arthur@acm.org>
ACKNOWLEDGEMENTS
Damian Conway and Greg London.
COPYRIGHT AND LICENSE
Copyright (C) 2007-8 by Arthur Goldstein
BUGS
Please email in bug reports.
TO DO AND FUTURE POSSIBLE CHANGES
Please send in suggestions.
SEE ALSO
example directory
Parsing texts, including references to Extended Backus-Naur Form notation.
Parse::Stallion::CSV, Parse::Stallion::CSVFH, Parse::Stallion::EBNF.
Perl 6 grammars.
lex, yacc, ..., other parsers.