NAME
Parse::Stallion - EBNF based regexp backtracking parser and tree evaluator.
SYNOPSIS
use Parse::Stallion;
my %rules = (rule_name_1 => ..rule_definition.. ,
rule_name_2 => ..rule_definition.. ,
...);
my $stallion = new Parse::Stallion(
\%rules,
# the following parameters are optional
{start_rule => 'rule_name_1', #default the rule which is not a subrule
do_evaluation_in_parsing => 0, #default 0
no_evaluation => 0, #default 0
max_steps => 200000, #default 1000000;
do_not_compress_eval => 0, #default 0
separator => '__XZ__', #default '__XZ__'
need_not_match_whole_string => 0, #default 0
parse_forward => sub {...}, #default no sub
parse_backtrack => sub {...}, #default no sub
traversal_only => 0, #default 0
unreachable_rules_allowed => 0, #default 0
fast_move_back => 1, #default 1 unless any unevaluation/parse_backtrack
});
my $parse_info = {}; # optional, little impact on performance
my $parse_hash = {}; # optional, little impact on performance
my $parse_trace = []; # optional, some impact on performance
my $result = $stallion->parse_and_evaluate($given_string,
# usually omit the following
{max_steps => 30000, #default from parser's creation
parse_info => $parse_info, #if provided, parse info returned
parse_trace => $parse_trace, # if provided, trace returned
start_position => 0, #default 0
start_rule => $start_rule, # default from parser creation
parse_hash => $parse_hash, #used as parse_hash in called routines
});
# returns undef if unable to parse
Rule Definitions (may be abbreviated to first letter):
AND('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))
OR('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))
MULTIPLE('subrule_1', EVALUATION(sub{...}))
LEAF(qr/regex/, EVALUATION(sub{...}))
DESCRIPTION
Stallion parses and evaluates a string using entered grammar rules. The parsing is done top-down via a start rule, in a depth first search forming a parse tree. When a rule does not match the parser backtracks to a node that has another option.
For evaluating a tree node, the evaluation subroutine is given a reference to a hash representing the returned values of the child nodes. The evaluation may be done while creating the parse tree and reject a match affecting which strings parse; this allows complex grammars.
If the evaluation is not done while parsing, on a successful parse, the tree is evaluated in bottom up, left to right order.
The grammars recognized are context free and are similar to those expressed in Extended Backus-Naur Form (EBNF).
The object being parsed does not need to be a string. Except for the section on non-strings, the documentation assumes strings are being parsed.
COMPLETE EXAMPLES
The following examples read in two unsigned integers and adds them.
use Parse::Stallion;
my %basic_grammar = (
expression =>
AND('number',
qr/\s*\+\s*/,
'number',
EVALUATION(
sub {return $_[0]->{number}->[0] + $_[0]->{number}->[1]})
),
number => LEAF(qr/\d+/)
);
my $parser = new Parse::Stallion(\%basic_grammar);
my $result = $parser->parse_and_evaluate('7+4');
#$result should contain 11
Using more terse function names:
my %grammar_2 = (
expression =>
A('number',
qr/\s*\+\s*/,
{right_number => 'number'},
E(sub {return $_[0]->{number} + $_[0]->{right_number}})
),
number => L(qr/\d+/)
);
my $parser_2 = new Parse::Stallion(
\%grammar_2, {start_rule => 'expression'});
my $result_2 = $parser_2->parse_and_evaluate('8 + 5');
#$result_2 should contain 13
Reading a grammar from a string instead of functions:
use Parse::Stallion::EBNF; #see documentation on Parse::Stallion::EBNF
my $grammar_3 = 'start = (left.number qr/\s*\+\s*/ right.number)
S{return $_[0]->{left} + $_[0]->{right}}S;
number = qr/\d+/;';
my $parser_3 = ebnf_new Parse::Stallion::EBNF($grammar_3);
my $result_3 = $parser_3->parse_and_evaluate('1 + 6');
#$result_3 should contain 7
RULES
There are 4 rule types: 'LEAF', 'AND', 'OR', and 'MULTIPLE'.
Parsing begins from the start rule, if the 'start_rule' parameter is omitted, the rule which is not a subrule is used as the start rule. The start rule can be of any type, though if the start rule is a 'LEAF', the grammar is essentially just a regular expression.
After a successful parse, the external nodes correspond to the substrings that the 'LEAF' rules matched; the other rule types correspond to the internal nodes.
LEAF
A 'LEAF' rule contains a regexp that must match the input string starting from the current position. During parsing, when a 'LEAF' matches, the parser's position is moved to the end of the match. The matched value is stored in the parse tree. The text LEAF is optional, regexp's are assumed to be leaves.
One may use the keyword TOKEN or TERMINAL instead of LEAF.
If the regexp matches but is empty, the matched value is set to an empty string as opposed to an undefined value.
Examples (equivalent):
LEAF(qr/xx\w+/)
L(qr/xx\w+/)
qr/xx\w+/
TOKEN(qr/xx\w+/)
would match any perl word (\w+) starting with "xx".
See the section 'LEAF DETAILS' for more details on 'LEAF' rules.
AND
An 'AND' rule contains a list of subrules that must be completely matched, from left to right, for the 'and' rule to match.
Examples (equivalent):
AND('rule_1', 'rule_2', 'rule_3')
A('rule_1', 'rule_2', 'rule_3')
OR
An 'OR' rule contains a list of subrules, one of which must be matched for the 'OR' rule to match.
During parsing, the subrules are attempted to be matched left to right. If a subrule matches and then is subsequently backtracked, the parser will try to match the next subrule.
Examples (equivalent):
OR('rule_1', 'rule_2', 'rule_3')
O('rule_1', 'rule_2', 'rule_3')
MULTIPLE (and OPTIONAL)
A 'MULTIPLE' rule matches if its subrule matches repeatedly between a minimum and maximum number of times. The first parameter to 'MULTIPLE' is the subrule, the next 2 optional parameters are the minimum and maximum repetitions allowed. The default minimum is 0 and the default maximum is "infinite", though this is represented by setting the maximum to 0.
Each occurrence of the subrule must increase the position, this prevents "left recursion".
By default the maximal number of possible matches of the repeating rule are tried and then if backtracking occurs, the number of matches is decremented.
Examples (equivalent):
MULTIPLE('rule_1')
M('rule_1')
M('rule_1', 0, 0)
One can label a rule with the value 'OPTIONAL' that maps to a 'MULTIPLE' rule with minimum 0 and maximum 1.
Examples (equivalent):
OPTIONAL('rule_1')
ZERO_OR_ONE('rule_1')
Z('rule_1')
MULTIPLE('rule_1', 0, 1)
to get one or more:
MULTIPLE('rule_2', 1, 0)
MATCH_MIN_FIRST
If the parameter MATCH_MIN_FIRST is passed in to a MULTIPLE rule, the minimal number of matches is tried first and the number of matches increases when backtracking.
my $mmf_parser = new Parse::Stallion({
start_expression => A(M(qr/i/,1,0,MATCH_MIN_FIRST),{rest=>qr/.*/},
E(sub {return $_[0]->{rest}}))
});
my $mmf_result = $mmf_parser->parse_and_evaluate('ii');
# $mmf_result should contain 'i'
my $nmmf_parser = new Parse::Stallion({
start_expression => A(M(qr/i/,1,0),{rest=>qr/.*/},
E(sub {return $_[0]->{rest}}))
});
my $nmmf_result = $nmmf_parser->parse_and_evaluate('ii');
# $nmmf_result should contain ''
SIMILARITY BETWEEN RULE TYPES.
The following rules all parse tree-wise equivalently and evaluate equivalently since a MULTIPLE rule that has at most 1 child does not return an array ref.
AND('subrule')
O('subrule')
MULTIPLE('subrule', 1, 1)
M('subrule', 1, 1)
NESTED RULES
Rules can be nested inside of other rules. See the section EVALUATION for how nested rules affect tree evaluations.
Example:
sum => AND('number', MULTIPLE(AND('plus', 'number')));
is equivalent, parsing wise, to
sum => A('number', 'plus_numbers');
plus_numbers = M('plus_number');
plus_number => A('plus', 'number');
ALIASES
One can also use an alias for a rule by a hash reference with a single key/value pair. This sets the name to the key when evaluating the parsed expression and parsing the subrule specified by the value.
adding = A(
'number', L(qr/\s*[+]\s*/),
{right_number => 'number'},
E(sub {return $_[0]->{number} + $_[0]->{right_number}})
);
RULE NAMES
Rule names cannot contain the 'separator' substring '__XZ__', to avoid conflicting with internally generated rule names. This can be changed by using the 'separator' parameter.
ENSURING RULES FORM COMPLETE GRAMMAR
Stallion ensures that a grammar is complete and croaks if the given grammar has any rules not reachable from the start rule or if within any rule a subrule does not exist.
The parameter unreachable_rules_allowed can override the reachability condition.
MATCH_ONCE
If the parameter MATCH_ONCE is part of a rule, only one match is tried; there is no backtracking to other possible matches involving descendants of a node created by that rule. This is for making matching faster by cutting down the number of steps though it also does affect what can be matched.
In the parser below 'x' can be matched by the rule but 'xx' will not be because of the MATCH_ONCE, 'OR' rules are tried left to right.
$p = new Parse::Stallion({r => O(qr/x/, qr/xx/, qr/yy/, MATCH_ONCE)});
$p->parse_and_evaluate('x'); #returns 'x'
$p->parse_and_evaluate('yy'); #returns 'yy'
$p->parse_and_evaluate('xx'); #does not parse, returns undef
In a 'MULTIPLE' rule with MATCH_ONCE, if MATCH_MIN_FIRST is used, then the maximum repetition number is meaningless. It makes almost no sense to have MATCH_ONCE, MATCH_MIN_FIRST, and a minimum repetition of 0 in the same rule, the rule will be matched with no children.
If there are no parse_backtrack and no unevaluation routines then a MATCH_ONCE routine can be done in fast mode, just chopping off the node from the tree without backtracking through the subtree at the node. The check for parse_backtrack and unevaluation routines can be overridden by creating the parser with the fast_move_back parameter.
MATCH_ONCE in MULTIPLE Examples
The following parsers all parse the same strings but can take different amounts of steps on the same input.
my $pi = {};
my $mo_1 = new Parse::Stallion(
{rule1 => A(M(qr/t/), M(qr/t/), qr/u/)});
$result = $mo_1->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 1 parse steps ".$pi->{number_of_steps}."\n"; # should be 157
my $mo_2 = new Parse::Stallion(
{rule2 => A(M(qr/t/, MATCH_ONCE), M(qr/t/, MATCH_ONCE), qr/u/)});
$result = $mo_2->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 2 parse steps ".$pi->{number_of_steps}."\n"; # should be 15
my $mo_3 = new Parse::Stallion(
{rule2 => A(M(qr/t/, MATCH_ONCE), M(qr/t/, MATCH_ONCE),
L(qr/u/, PB(sub {return 0})), MATCH_ONCE)});
$result = $mo_3->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 3 parse steps ".$pi->{number_of_steps}."\n"; # should be 27
my $mo_4 = new Parse::Stallion(
{rule2 => A(M(qr/t/, MATCH_ONCE), M(qr/t/, MATCH_ONCE),
L(qr/u/, PB(sub {return 0})), MATCH_ONCE)}, {fast_move_back => 1});
$result = $mo_4->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 4 parse steps ".$pi->{number_of_steps}."\n"; # should be 15
PARSE_AND_EVALUATE
After setting up a Parse::Stallion parser, strings are parsed and evaluated via parse_and_evaluate. The returned value is the returned value of the root node's evaluation routine.
In a typical parse, the current position corresponds to how much of the string from left to right has been parsed, moving from 0 to its length.
PARTIAL STRING PARSES
The parser completes a parse only if the grammar matches the whole string unless the parameter need_not_match_whole_string is set.
Example showing differences
my $one_grammar = {start => qr/1/};
my $parser = new Parse::Stallion($one_grammar);
my $partial_parser = new Parse::Stallion($one_grammar,
{need_not_match_whole_string => 1});
$parser->parse_and_evaluate('12'); # does not parse
$partial_parser->parse_and_evaluate('12'); # parses (returns '1')
START POSITION AND REFERENCE INPUT
The default start position of $input_string is pos $input_string which is most likely 0. One can specify the start_position as a parameter of parse_and_evaluate or by creating an initial_position_routine.
Example showing how to loop on input
my $parser = new Parse::Stallion({n => L(qr/(\d+)\;/,E(sub{$_[0]+1}))},
{need_not_match_whole_string => 1});
my $input = '342;234;532;444;3;23;';
my $pi = {final_position => 0};
my $input_length = length($input);
while ($pi->{final_position} != $input_length) {
push @results, $parser->parse_and_evaluate($input,
{parse_info=> $pi, start_position => $pi->{final_position}});
}
# @results should contain (343, 235, 533, 445, 4, 24)
However, by noting that the pos of a string is set by parse_and_evaluate, one can rewrite the loop above as:
my $parser = new Parse::Stallion({n => L(qr/(\d+)\;/,E(sub{$_[0]+1}))},
{need_not_match_whole_string => 1});
my $input = '342;234;532;444;3;23;';
while (my $result = $parser->parse_and_evaluate($input)) {
push @results, $result;
}
# @results should contain (343, 235, 533, 445, 4, 24)
# pos $input would be undef before loop and 21 after loop
RETURNED VALUES
One can check if the returned value is 'undef' to determine if an evaluation failed; an empty string is returned if parsing succeeds but evaluation results in an undef. Also one can look at $parse_info->{parse_succeeded} which has a value even if there is no evaluation.
To get details on the parsing from parse_and_evaluate, there are two optional parameters: parse_info that receives a hash ref and parse_trace that receives an array ref. These ref's are filled in during parse_and_evaluate.
my $parse_info = {};
my $parse_trace = [];
my $value = $parser->parse_and_evaluate($string_to_parse,
{parse_info=>$parse_info, parse_trace => $parse_trace});
$parse_info->{parse_succeeded}; # true (1) if string parses
$parse_info->{tree}; # Root of resultant parse tree
$parse_info->{number_of_steps}; # Number of steps taken
$parse_info->{start_rule}; # Start rule used for this parse
$parse_info->{start_position}; # Initial position of parse
$parse_info->{final_position}; # 0 if parse failed
$parse_info->{final_position_rule}; # Last rule looked at
$parse_info->{maximum_position}; # Maximum position in parse
$parse_info->{maximum_position_rule}; # First rule at maximum position
$parse_info->{parse_backtrack_value};
# 0 unless parse backtrack call ends parse
An entry in $parse_trace looks like:
$parse_trace->[$step]->{rule_name}
$parse_trace->[$step]->{moving_forward} # 0 if backtracking
$parse_trace->[$step]->{moving_down} # 0 if moving up parse tree
$parse_trace->[$step]->{current_position}
$parse_trace->[$step]->{node_creation_step} # for current node
$parse_trace->[$step]->{parent_node_creation_step}
$parse_trace->[$step]->{message} # informative message on previous step
$parse_trace->[$step]->{tree} # stringified snapshot of parse tree
STRINGIFIED PARSE TREE
The tree is a Parse::Stallion object having a function, that converts a tree into a string, each node consisting of one line:
$parse_info->{tree}->stringify({values=>['name','parse_match']});
Internally generated node names, from rules generated by breaking up the entered rules into subrules, will show up. The module Parse::Stallion::EBNF shows the grammar with these generated subrules. The node keys described in the section Parse Tree Nodes can be passed to stringify.
PARSE STEPS
A step is an action on a node, roughly speaking matching a regex for a 'leaf' node, or moving forward or backtracking from a node. Each step is listed in the optional parse_trace.
If the parsing reaches the maximum number of steps the parse fails (croak). The maximum number of steps can be changed via max_steps. If max_steps is set to a negative number, there is no limit on the number of steps.
$stallion->parse_and_evaluate($string, {max_steps=>200000});
"LEFT RECURSION"
Parse::Stallion checks the grammar for "left recursion" and will not build the grammar if left recursion is detected (croaks). However, there are some cases which are not possible to detect, i.e. whether a parse forward routine will change the position or a case where a regexp matches but returns an empty string such as qr/\B/ .
Parse::Stallion may encounter "left recursion" during parsing in which case the parsing stops and a message is 'croak'ed.
"Left recursion" occurs during parsing when the same non-'leaf' rule shows up a second time on the parse tree at the same position.
Illegal Case 1:
expression => AND('expression', 'plus', 'term')
Expression leads to expression leads to expression ....
Illegal Case 2:
rule_with_empty => AND('empty', 'rule_with_empty', 'other_rule')
empty => qr//
The second case is detected while building the grammar, regexp's are checked to see if they match the empty string.
Illegal Case 3:
rule_with_pf => A(L(PF(sub {return 1}))', 'rule_with_pf',
'other_rule')
The third case will be detected during parsing.
EVALUATION
Evaluation can occur during or after parsing.
Each rule may have an evaluation subroutine tied to it. When defining a rule if there is no subroutine, a default "do nothing" subroutine is provided. Nested subrules may have a subroutine tied to them though by default they have none.
When setting up a rule, one can specify the EVALUATION subroutine by the parameter 'EVALUATION' or 'E' which in turn takes one parameter, the evaluation subroutine for that node.
Each node has a computed value that is the result of calling its evaluation routine. The returned value of the parse is the computed value of the root node.
There are two parameters to a node's evaluation routine.
The first parameter to the evaluation routine is either a hash or a string.
If the node is a leaf regexp that has a parenthesized match inside, what is matched by the first parenthesized match is the parameter. Else if the node is a leaf then what is matched by the leaf is the first parameter. Else if 'USE_STRING_MATCH()' has been set for the node's rule, the substring equivalent to a join of all the matched strings of the nodes' descendants is the parameter.
For other internal nodes, the first parameter is a hash. The hash's keys are the named subrules of the node's rule, the values are the computed value of the corresponding children nodes. If a key could repeat, the value is an array reference.
The second parameter to an evaluation routine is the "parse_hash", see the section entitled Parse Hash.
By nesting a rule with an alias, the alias is used for the name of the hash parameter instead of the rule name.
Parse Hash
The parse hash is a hash ref that is passed to the evaluation, unevaluation, parse_forward, and parse_backtrack routines. It is the same hash ref throughout a specific parse so one can store values there to pass among the routines. One can pass in a hash ref, with some preset keys, to be used as the parse_hash for a given parse_and_evaluate call.
There are several keys that are set
parse_this_ref # Reference to object being parsed
current_position # The current position of the object being parsed
#, not set for evaluation after parsing
parent_node # Node in tree for child to be created for parse_forward
# and parse_backtrack, child has been removed before call
# to parse_backtrack; see Parse Tree Nodes
current_node # active node in parse tree for evaluation, unevaluation,
# see section Parse Tree Nodes
parse_match # for parse_backtrack routine, match from parse_forward
rule_name # Name of rule, may be internally generated
rule_info # Hash of rule names which have RULE_INFO set
Any future parameters will be given double underscores at the start of their names.
Parse Tree Nodes
A parse tree node is a hash reference composed of:
name, # Name of rule, this may be internally generated
alias, # Alias of rule in evaluation phase parameter list
steps, # Number of steps in parse when node created
parent, # Node's parent hash reference
position_when_entered, # Position when node created
position_when_completed, # Position when node completely parsed
children, # Array reference to nodes of children (undef if leaf)
child_count, # scalar(@children)
parse_match, # matched leaf string or match returned by parse_forward
In addition, there will be some keys beginning with double underscore (__).
Parse tree nodes are also passed to the parse forward and parse backtrack subroutines.
One should not change the values of a parse node hash. However, one is allowed to add key/values to the hash (avoid key names with double underscores).
Evaluation Examples
Comments refer to the parameters of the evaluation subroutine.
LEAF(qr/d\w+/, E(sub {...})) # word starting with d
L(qr/(d)\w+/, E(sub {...})) # $_[0] = 'd'
L(qr/\s*(\w+)\s*/, E(sub {...})) # white space trimmed
A(qr/a/, qr/b/, E(sub {...})) #$_[0]->{''}->['a','b']
A(qr/a/, qr/b/, E(sub {...}), USE_STRING_MATCH()) # $_[0] = 'ab'
A({f=>qr/a/}, {f=>qr/b/}, E(sub {...})) # $_[0]->{'f'} = ['a','b']
A('rule1', 'rule2', 'rule3', E(sub {...}))
#params are $_[0]->{rule1}, $_[0]->{rule2}, $_[0]->{rule3}
LOCATION
A function, LOCATION is provided that takes a string reference and a position in the string and computes the line number and tab value of the given position.
my ($line, $tab);
my $loc_grammar = {
start =>
A(qr/....../s,
L(qr//, E(
sub {
($line, $tab) = LOCATION($_[1]->{parse_this_ref},
$_[1]->{current_node}->{position_when_entered})
})),
qr/.*/s)
};
my $loc_parser = new Parse::Stallion($loc_grammar);
$loc_parser->parse_and_evaluate("ab\nd\nfghi");
# $line == 3, $tab == 2
One can also use this function after a parse to determine where the
maximum position occured in the input string.
EVALUATION AFTER PARSING
In evaluation after parsing, the default, Parse::Stallion will evaluate the parse tree in a bottom up left to right traversal.
EVALUATION (AND UNEVALUATION) DURING PARSING
If the do_evaluation_in_parsing is set when a Parse::Stallion object is created the evaluation occurs during the parsing instead of afterwards. Alternatively, if there exists any UNEVALUATION routine, the evaluation is done during parsing.
Every time a node is matched, its evaluation routine is called. The parse_hash will have the key current_position set and the node will not yet have the key position_when_completed.
The evaluation routine may return a second parameter that tells Parse::Stallion to reject or not reject the match. This allows more control over what can be parsed.
The same node may be evaluated several times due to backtracking. One must be careful either to not change the references/objects passed in to the evaluation routine or to undo changes in an unevaluation routine. An unevaluation routine is called when backtracking reverts back to a node, the parameters are the same as for the evaluation routine. UNEVALUATION (alternatively U) are used to set the subroutine.
The following example shows a 'leaf' rule that matches all words except for those marked in the hash %keywords:
our %keywords = ('key1'=> 1, 'key2' => 1);
my %g = (
start => A('leaf', qr/\;/),
leaf => L(
qr/\w+/,
E(sub {if ($keywords{$_[0]}) {return (undef, 1)} return $_[0]}),
),
);
my $parser = new Parse::Stallion(\%g, {do_evaluation_in_parsing=>1});
$parser->parse_and_evaluate('key1;'); #should return false
$parser->parse_and_evaluate('key3;'); #should return true
# ( {''=>';',leaf=>'key3'} )
EVALUATION DURING PARSING EXAMPLE
In this example, the first statement tells a truth about the number of elements in a list, the second tells whether or not the first statement is true. If the second statement is true, the string parses.
my %parsing_rules = (
start_expression =>
AND('two_statements', qr/\z/,
E(sub {return $_[0]->{'two_statements'}})
),
two_statements =>
A('list_statement','truth_statement',
EVALUATION(sub {
if ($_[0]->{list_statement} != $_[0]->{truth_statement}) {
return (undef, 1);
}
return 1;
})
),
list_statement =>
A('count_statement', 'list',
E(sub {
if ($_[0]->{count_statement} == scalar(@{$_[0]->{list}})) {
return 1;
}
return 0;
})
),
count_statement =>
A(qr/there are /i,'number',L(qr/ elements in /),
E(sub { return $_[0]->{number}; })
),
number => qr/\d+/ ,
list => A('number', M(A(qr/\,/}, 'number')),
EVALUATION(sub {return $_[0]->{number}})
),
truth_statement =>
O({t => qr/\. that is the truth\./},
{t=>qr/\. that is not the truth\./},
E(sub {
if ($_[0]->{t} =~ /not/) {
return 0;
}
return 1;
})
),
);
my $how_many_parser = new Parse::Stallion(
\%parsing_rules,
{do_evaluation_in_parsing => 1,
start_rule => 'start_expression',
});
$result = $how_many_parser->parse_and_evaluate(
"there are 5 elements in 5,4,3,2,1. that is the truth.");
print "$result should be 1\n";
$result = $how_many_parser->parse_and_evaluate(
"there are 5 elements in 5,4,3,1. that is not the truth.");
print "$result should be 1\n";
$result = $how_many_parser->parse_and_evaluate(
"there are 5 elements in 5,4,3,1. that is the truth.");
print "$result should be undef\n";
DEFAULT EVALUATION ROUTINE
If a rule does not have an evaluation routine specified, it is as if a generic subroutine is used.
Generic Evaluation Routine 1
If the passed in hash reference has only one key, then the value of that key in the hash reference is returned.
If the passed in hash reference has more than one key, then the hash reference is returned.
It is as if this routine were run:
sub {
my $parameters = shift;
if (ref $parameters eq 'HASH') {
if (keys %$parameters == 1) {
my ($key) = keys %$parameters;
return $parameters->{$key};
}
}
return $parameters;
}
Generic Evaluation Routine 2
If do_not_compress_eval is set when creating the parser, it is as if this routine were run:
sub {
return shift;
}
Example:
my %n_eval_rules = (
start_rule => A('term',
M(A({plus=>qr/\s*\+\s*/}, 'term'))),
term => A({left=>'number_or_x'},
M(A({times=>qr/\s*\*\s*/},
{right=>'number'}))),
number_or_x => O('number',qr/x/),
number => qr/\s*\d*\s*/
);
my $n_eval_parser = new Parse::Stallion(\%n_eval_rules,
{do_not_compress_eval => 0});
$result = $n_eval_parser->parse_and_evaluate("7+4*8");
#$result contains:
{ 'plus' => [ '+' ],
'term' => [ '7',
{ 'left' => '4',
'right' => [ '8' ],
'times' => [ '*' ] } ] }
my $dnce_n_eval_parser = new Parse::Stallion(\%n_eval_rules,
{do_not_compress_eval => 1});
$result = $dnce_n_eval_parser->parse_and_evaluate("7+4*8");
#$result contains:
{ 'plus' => [ '+' ],
'term' => [ { 'left' => {number => '7'} },
{ 'left' => {number => '4'},
'right' => [ {number => '8'} ],
'times' => [ '*' ] } ] }
Parameter types to Evaluation Routines
If a named parameter could appear more than once, it is passed as an array, else as a scalar. Being passed as an array could be caused by either:
the name being within a 'MULTIPLE' rule which does not have maximum children of 1
occurring more than once within the subrules of an 'AND' rule.
The routine which_parameters_are_arrays returns a hash of the possible values passed to an evaluation routine. For a given key, if the value is 1, the key would be passed to the evaluation routine as an array, if the value is 0, it would be passed as a scalar.
MORE COMPLICATED EVALUATION EXAMPLE
The following is a simple calculator:
%calculator_rules = (
start_expression => A(
'expression', qr/\z/,
E(sub {return $_[0]->{expression}})
),
expression => A(
'term',
M(A('plus_or_minus', 'term')),
E(sub {my $to_combine = $_[0]->{term};
my $plus_or_minus = $_[0]->{plus_or_minus};
my $value = $to_combine->[0];
for my $i (1..$#{$to_combine}) {
if ($plus_or_minus->[$i-1] eq '+') {
$value += $to_combine->[$i];
}
else {
$value -= $to_combine->[$i];
}
}
return $value;
})
),
term => A(
'number',
M(A('times_or_divide', 'number')),
E(sub {my $to_combine = $_[0]->{number};
my $times_or_divide = $_[0]->{times_or_divide};
my $value = $to_combine->[0];
for my $i (1..$#{$to_combine}) {
if ($times_or_divide->[$i-1] eq '*') {
$value *= $to_combine->[$i];
}
else {
$value /= $to_combine->[$i]; #does not check for zero
}
}
return $value;
})
),
number => L(
qr/\s*[+\-]?(\d+(\.\d*)?|\.\d+)\s*/,
E(sub{ return $_[0]; })
),
plus_or_minus => qr/\s*([\-+])\s*/,
times_or_divide => qr/\s*([*\/])\s*/
);
$calculator_parser = new Parse::Stallion(\%calculator_rules,
{start_rule => 'start_expression'
});
my $result = $calculator_parser->parse_and_evaluate("3+7*4");
# $result should contain 31
my $array_p = $calculator_parser->which_parameters_are_arrays('term');
# $array_p would be {number => 1, times_or_divide => 1}
$array_p = $calculator_parser->which_parameters_are_arrays(
'start_expression');
# $array_p would be {expression => 0}
STRING_EVALUATION
Instead of passing an evaluation subroutine, one can pass in the string of a subroutine via STRING_EVALAUATION (or SE). The string is modified so that the passed in parameters become local variables.
Example:
a_rule => A('x_1', 'x_2', 'x_3', 'x_1'),
SE('#comment'));
results in the evaluation routine equivalent to:
sub {
$x_1 = $_[0]->{x_1};
$x_2 = $_[0]->{x_2};
$x_3 = $_[0]->{x_3};
#comment
}
If the rule is a leaf then $_ is set to the parameter $_[0]; else, $_ is the one unnamed rule or an array ref of the unnamed rules:
$_ = $_[0]->{''};
NO EVALUATION
Evaluation can be prevented by setting no_evaluation when the grammar is created. The return of parse_and_evaluate is always undef in this case but parse_info can be used to view the parse results.
TRAVERSAL ONLY
There is some overhead to computing the parameters with the names of the subrules. If traversal_only is set when the grammar is created, this is not computed and the first parameter to each evaluation subroutine is undef.
LEAF DETAILS
Leaf rules can be set up as follows:
LEAF($leaf_arg, PARSE_FORWARD(sub{...}), PARSE_BACKTRACK(sub{...}),
EVALUATION(sub{...}), UNEVALUATION(sub{...}), LEAF_DISPLAY($display));
If $leaf_arg is a Regexp, it is converted into a hash ref: {regex_match => $leaf_arg} for internal purposes.
If a default PARSE_FORWARD is not provided then a regexp match is attempted on the string being parsed at the current_position. Default parse_forward and parse_backtrack subroutines can be provided for leaves.
The subroutine in PARSE_FORWARD (or PF) is called when moving forwards during the parse. Its one parameter is a hash described in the section entitled Parse Hash above.
If the parsing should continue forward it should return an array with the first argument true (1), the second argument the "parse_match" to store at the node, and an optional third argument being the new parse position. The new parse position must be greater than or equal to the previous position. If parsing should not continue, parse_forward should return 0.
The subroutine in PARSE_BACKTRACK (or PB) is called when backtracking has destroyed a (leaf) node. Its one parameter is the pares_hash.
The parse backtrack routine should normally return a false value. If it returns true, the parsing immediately ends in failure. The value ending the parse is returned in parse_backtrack_value. This can be used to set up a rule
pass_this_no_backtrack => L(qr//,PB(sub{return 1}))
that if encountered during parsing during a backtrack means that the parsing will end.
EVALUATION and UNEVALUATION are explained in the section 'EVALUATION'.
LEAF_DISPLAY
See Parse::Stallion::EBNF for this exported keyword.
PARSE_FORWARD and PARSE_BACKWARD on non-leaf nodes
It is possible to create PARSE_FORWARD and PARSE_BACKWARD routines on non-leaves. These routines are executed before the node is created and after it is destroyed. They may change the position of the parse and the returned value of the match is stored in the node's hash "parse_match" key.
OTHER PARSING NOTES
PARSING NON-STRINGS
Four subroutines may be provided: a default 'leaf' rule matcher/modifier for when the parser is moving forward and a default 'leaf' rule "unmodifier" for when the parser is backtracking. A third optional subroutine, initial_position_routine, sets the initial current value else it is 0.
The fourth subroutine, final_position_routine, should return the final position of a successful parse for a given object. This subroutine is similar to parsing strings ensuring, or not ensuring, that the entire string is matched instead of matching only a portion.
my $object_parser = new Parse::Stallion(\%grammar, {
...
parse_forward =>
sub {
my $parameters = shift;
...
return ($true_if_object_matches_rule,
$value_to_store_in_leaf_node,
$value_equal_or_greater_than_current_position);
},
parse_backtrack =>
sub {
my $parameters = shift;
...
return; #else parsing halts
},
initial_position_routine => sub {my ($object_ref, $parse_hash) = @_;
...
return $initial_position;
},
final_position_routine =>
sub {my ($object_ref, $current_position, $parse_hash) = @_;
...
return $final_position;
# parse ends if $final_position==$current_position
},
});
By default parsing only ends if the entire string is parsed and the start rule is matched. To allow parsing to end regardless of position in the string when the start rule is matched:
final_position_routine => sub {return $_[1];}
This is also done with the parameter need_not_match_whole_string.
When evaluating the parse tree, the parameters to the 'leaf' nodes are the values returned in parse_forward.
The script object_string.pl in the example directory demonstrates this section.
'LEAF' LEAF PARSE FORWARD/BACKTRACK
All 'leaf' rules need to be set up such that when the parser is moving forward and reaches a 'leaf', the 'leaf' rule attempts to match the object being parsed at the current position. If there is a match, then the current_position may increase.
When backtracking, the object being parsed should be reverted, if changed, to the state before being matched by the 'leaf' rule.
NON_DECREASING POSITION
The third value returned from parse_forward should be equal or greater than the $current_position that was passed in.
The position is used to detect and prevent "left recursion" by not allowing a non-'leaf' rule to repeat at the same position. 'MULTIPLE' rules are prevented from repeating more than once at the same position.
The position also cuts down on the number of steps by allowing the parser to not repeat dead-end parses. If during the parse, the same rule is attempted a second time on the parse object at the same position, and the first parse did not succeed, the parser will begin backtracking.
STRINGS
By default, strings are matched, which, if a reference to the string instead of the string is passed in to parse_and_evaluate, is similar to that found in the test case object_string.t:
my $calculator_stallion = new Parse::Stallion({
...
parse_forward =>
sub {
my $parameters = shift;
my $input_string_ref = $parameters->{parse_this_ref};
my $rule_info = $parameters->{rule_info};
my $rule_name = $parameters->{rule_name};
my $rule_definition = $rule_info->{$rule_name};
my $m = $rule_definition->{nsl_regex_match};
if ($$input_string_ref =~ s/\A($m)//) {
return (1, $1, 0 - length($string));
}
return 0;
},
parse_backtrack =>
sub {
my $parameters = shift;
my $input_string_ref = $parameters->{parse_this_ref};
my $stored_value = $parameters->{parse_match};
if (defined $stored_value) {
$$input_string_ref = $stored_value.$$input_string_ref;
}
return;
},
initial_position_routine =>
sub {
my $input_string_ref = shift;
return 0 - length($$input_string_ref);
},
final_position_routine =>
sub {
return 0;
}
});
PARSING VERY LARGE INPUTS
In a normal course of parsing, the input string is split up and stored at each leaf node. If the input size is very large, this could cause memory problems. Note, the input string is not copied, space will be consumed by the leaves containing all the substrings and the parse tree itself.
Below are some mostly untested tips on how this can be handled:
NOT STORING VALUES IN LEAFS
By allowing the first parenthesized expression of a leaf to be empty, i.e.: qr/()leaf_reg_ex/, then the empty string will be stored at that leaf. When evaluating the leaf, the first parameter will be the empty string. One can make use of USE_STRING_MATCH.
REMOVE NODES FROM TREE DURING PARSE
One need be very careful to try this. If the parse is always moving forward and will never backtrack then after a node is evaluated, i.e. if do_parsing_in_evaluation=1, then the children of that node are no longer needed. Since the evaluation routine has access to the node and the node's children, one can delete the {children} key from the node when a node is completed. This would keep the parse tree small.
EXPORT
The following are EXPORTED from this module:
A AND E EVALUATION L LEAF LEAF_DISPLAY LOCATION MATCH_MIN_FIRST
MATCH_ONCE M MULTIPLE O OPTIONAL OR PARSE_FORWARD PARSE_BACKTRACK PB
PF RULE_INFO R SE STRING_EVALUATION TERMINAL TOKEN U
UNEVALUATION USE_STRING_MATCH Z ZERO_OR_ONE
PERL Requirements
Parse::Stallion's installation uses Test::More and Time::Local requiring perl 5.6 or higher. Parse::Stallion should work with earlier versions of perl, neither of those modules is required outside of the test cases for installation.
VERSION
1.00
AUTHOR
Arthur Goldstein, <arthur@acm.org>
ACKNOWLEDGEMENTS
Damian Conway, Christopher Frenz, and Greg London.
COPYRIGHT AND LICENSE
Copyright (C) 2007-9 by Arthur Goldstein. All Rights Reserved.
This module is free software. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)
BUGS
Please email in bug reports.
TO DO AND FUTURE POSSIBLE CHANGES
Please send in suggestions.
SEE ALSO
example directory (includes stallion.html, a javascript translation of an earlier version of Parse::Stallion) and test case directory t
Parse::Stallion::EBNF. Outputs grammars in more readable form. Also contains an example of how one could input a grammar from a string, similar to how many parsers take their input.
Parse::Stallion::CSV. Example of how to create a parser from specification.
Perl 6 grammars.
lex, yacc, ..., other parsers.