NAME

Parse::Stallion - EBNF based regexp backtracking parser and tree evaluator.

SYNOPSIS

use Parse::Stallion;

my %rules = (rule_name_1 => ..rule_definition.. ,
 rule_name_2 => ..rule_definition.. ,
 ...);

my $stallion = new Parse::Stallion(
  \%rules,
   # the following parameters are optional
 {start_rule => 'rule_name_1', #default the rule which is not a subrule
  do_evaluation_in_parsing => 0, #default 0
  no_evaluation => 0, #default 0
  max_steps => 200000, #default 1000000;
  do_not_compress_eval => 0, #default 0
  separator => '__XZ__', #default '__XZ__'
  parse_forward => sub {...}, #default no sub
  parse_backtrack => sub {...}, #default no sub
  unreachable_rules_allowed => 0, #default 0
  fast_move_back => 1, #default 1 unless any unevaluation/parse_backtrack
  parse_trace_routine => sub {...}, #default undef
  incorporate => [{grammar_source => $other_parse_stallion,
                   prefix =>$p}, ...], #default undef
});

my $parse_info = {}; # optional, little impact on performance
my $parse_hash = {}; # optional, little impact on performance
my $parse_trace = []; # optional, some impact on performance
my $result = $stallion->parse_and_evaluate($given_string,
  # usually omit the following
 {max_steps => 30000, #default from parser's creation
  parse_info => $parse_info, # optional, parse info returned
  parse_trace => $parse_trace, # optional, trace returned
  parse_trace_routine => sub {...}, # optional, default from parser
  start_position => 0, #default 0
  start_rule => $start_rule, # default from parser creation
  parse_hash => $parse_hash, #used as parse_hash in called routines
  match_start => 1, #default 1
  match_length => 1, #default 1
  match_maximum => 0, #default 0
  match_minimum => 0, #default 0
  global => 0, #default 0
  find_all => 0, #default 0 unless global == 1 and wantarray
  substitute => 0, #default 0
  substitution => undef, #default undef
 });
# returns undef if unable to parse
my @result = $stallion->parse_and_evaluate($given_string, {global=>1,...})
# returns array of parsed results

$result = $stallion->search($given_string, {...});
$result = $stallion->search_and_substitute($given_string, {...});

Rule Definitions (may be abbreviated to first letter):

AND('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))

OR('subrule_1', 'subrule_2', ..., EVALUATION(sub{...}))

MULTIPLE('subrule_1', EVALUATION(sub{...}))

LEAF(qr/regex/, EVALUATION(sub{...}))

DESCRIPTION

Stallion parses and evaluates a string using entered grammar rules. The parsing is done top-down via a start rule, in a depth first search forming a parse tree. When a rule does not match the parser backtracks to a node that has another option.

For evaluating a tree node, the evaluation subroutine is given a reference to a hash representing the returned values of the child nodes. The evaluation may be done while creating the parse tree and reject a match affecting which strings parse; this allows complex grammars.

If the evaluation is not done while parsing, on a successful parse, the tree is evaluated in bottom up, left to right order.

The grammars recognized are context free and are similar to those expressed in Extended Backus-Naur Form (EBNF).

The object being parsed does not need to be a string. Except for the section on non-strings, the documentation assumes strings are being parsed.

COMPLETE EXAMPLES

The following examples read in two unsigned integers and adds them.

use Parse::Stallion;

 my %basic_grammar = (
  expression =>
   AND('number',
    qr/\s*\+\s*/,
    'number',
    EVALUATION(
     sub {return $_[0]->{number}->[0] + $_[0]->{number}->[1]})
  ),
  number => LEAF(qr/\d+/)
 );

 my $parser = new Parse::Stallion(\%basic_grammar);

 my $result = $parser->parse_and_evaluate('7+4');
 #$result should contain 11

Using more terse function names:

my %grammar_2 = (
 expression =>
  A('number',
   qr/\s*\+\s*/,
   {right_number => 'number'},
   E(sub {return $_[0]->{number} + $_[0]->{right_number}})
 ),
 number => L(qr/\d+/)
);

my $parser_2 = new Parse::Stallion(
 \%grammar_2, {start_rule => 'expression'});

my $result_2 = $parser_2->parse_and_evaluate('8 + 5');
#$result_2 should contain 13

Reading a grammar from a string instead of functions:

use Parse::Stallion::EBNF; #see documentation on Parse::Stallion::EBNF

my $grammar_3 = 'start = (left.number qr/\s*\+\s*/ right.number)
   S{return $left + $right}S;
  number = qr/\d+/;';

my $parser_3 = ebnf_new Parse::Stallion::EBNF($grammar_3);

my $result_3 = $parser_3->parse_and_evaluate('1 + 6');
#$result_3 should contain 7

RULES

There are 4 rule types: 'LEAF', 'AND', 'OR', and 'MULTIPLE'.

Parsing begins from the start rule, if the 'start_rule' parameter is omitted, the rule which is not a subrule, except possibly of itself, is used as the start rule. The start rule can be of any type, though if the start rule is a 'LEAF', the grammar is essentially just a regular expression.

After a successful parse, the external nodes correspond to the substrings that the 'LEAF' rules matched; the other rule types correspond to the internal nodes.

LEAF

A 'LEAF' rule contains a regexp that must match the input string starting from the current position. During parsing, when a 'LEAF' matches, the parser's position is moved to the end of the match. The matched value is stored in the parse tree. The text LEAF is optional, regexp's are assumed to be leaves.

One may use the keyword TOKEN or TERMINAL instead of LEAF.

If the regexp matches but is empty, the matched value is set to an empty string as opposed to an undefined value.

Examples (equivalent):

LEAF(qr/xx\w+/)

L(qr/xx\w+/)

qr/xx\w+/

TOKEN(qr/xx\w+/)

would match any perl word (\w+) starting with "xx".

See the section 'LEAF DETAILS' for more details on 'LEAF' rules.

AND

An 'AND' rule contains a list of subrules that must be completely matched, from left to right, for the 'and' rule to match.

Examples (equivalent):

AND('rule_1', 'rule_2', 'rule_3')

A('rule_1', 'rule_2', 'rule_3')

OR

An 'OR' rule contains a list of subrules, one of which must be matched for the 'OR' rule to match.

During parsing, the subrules are attempted to be matched left to right. If a subrule matches and then is subsequently backtracked, the parser will try to match the next subrule.

Examples (equivalent):

OR('rule_1', 'rule_2', 'rule_3')

O('rule_1', 'rule_2', 'rule_3')

MULTIPLE (and OPTIONAL)

A 'MULTIPLE' rule matches if its subrule matches repeatedly between a minimum and maximum number of times. The first parameter to 'MULTIPLE' is the subrule, the next 2 optional parameters are the minimum and maximum repetitions allowed. The default minimum is 0 and the default maximum is "infinite", though this is represented by setting the maximum to 0.

Each occurrence of the subrule must increase the position, this prevents "left recursion".

By default the maximal number of possible matches of the repeating rule are tried and then if backtracking occurs, the number of matches is decremented.

Examples (equivalent):

MULTIPLE('rule_1')

M('rule_1')

M('rule_1', 0, 0)

One can label a rule with the value 'OPTIONAL' that maps to a 'MULTIPLE' rule with minimum 0 and maximum 1.

Examples (equivalent):

OPTIONAL('rule_1')

ZERO_OR_ONE('rule_1')

Z('rule_1')

MULTIPLE('rule_1', 0, 1)

to get one or more:

MULTIPLE('rule_2', 1, 0)

MATCH_MIN_FIRST

If the parameter MATCH_MIN_FIRST is passed in to a MULTIPLE rule, the minimal number of matches is tried first and the number of matches increases when backtracking.

my $mmf_parser = new Parse::Stallion({
 start_expression => A(M(qr/i/,1,0,MATCH_MIN_FIRST),{rest=>qr/.*/},
  E(sub {return $_[0]->{rest}}))
});

my $mmf_result = $mmf_parser->parse_and_evaluate('ii');
# $mmf_result should contain 'i'

my $nmmf_parser = new Parse::Stallion({
 start_expression => A(M(qr/i/,1,0),{rest=>qr/.*/},
  E(sub {return $_[0]->{rest}}))
});

my $nmmf_result = $nmmf_parser->parse_and_evaluate('ii');
# $nmmf_result should contain ''

SIMILARITY BETWEEN RULE TYPES.

The following rules all parse tree-wise equivalently and evaluate equivalently since a MULTIPLE rule that has at most 1 child does not return an array ref.

AND('subrule')

O('subrule')

MULTIPLE('subrule', 1, 1)

M('subrule', 1, 1)

NESTED RULES

Rules can be nested inside of other rules. See the section EVALUATION for how nested rules affect tree evaluations.

Example:

sum => AND('number', MULTIPLE(AND('plus', 'number')));

is equivalent, parsing wise, to

sum => A('number', 'plus_numbers');
plus_numbers = M('plus_number');
plus_number => A('plus', 'number');

ALIASES

One can also use an alias for a rule by a hash reference with a single key/value pair. This sets the name to the key when evaluating the parsed expression and parsing the subrule specified by the value.

adding =  A(
 'number', L(qr/\s*[+]\s*/),
   {right_number => 'number'},
 E(sub {return $_[0]->{number} + $_[0]->{right_number}})
);

RULE NAMES

Rule names cannot contain the 'separator' substring '__XZ__', to avoid conflicting with internally generated rule names. This can be changed by using the 'separator' parameter.

ENSURING RULES FORM COMPLETE GRAMMAR

Stallion ensures that a grammar is complete and croaks if the given grammar has any rules not reachable from the start rule or if within any rule a subrule does not exist.

The parameter unreachable_rules_allowed can override the reachability condition.

MATCH_ONCE

If the parameter MATCH_ONCE is part of a rule, only one match is tried; there is no backtracking to other possible matches involving descendants of a node created by that rule. This is for making matching faster by cutting down the number of steps though it also does affect what can be matched.

In the parser below 'x' can be matched by the rule but 'xx' will not be because of the MATCH_ONCE, 'OR' rules are tried left to right.

$p = new Parse::Stallion({r => O(qr/x/, qr/xx/, qr/yy/, MATCH_ONCE)});
$p->parse_and_evaluate('x'); #returns 'x'
$p->parse_and_evaluate('yy'); #returns 'yy'
$p->parse_and_evaluate('xx'); #does not parse, returns undef

In a 'MULTIPLE' rule with MATCH_ONCE, if MATCH_MIN_FIRST is used, then the maximum repetition number is meaningless. It makes almost no sense to have MATCH_ONCE, MATCH_MIN_FIRST, and a minimum repetition of 0 in the same rule, the rule will be matched with no children.

If there are no parse_backtrack and no unevaluation routines then a MATCH_ONCE routine can be done in fast mode, just chopping off the node from the tree without backtracking through the subtree at the node. The check for parse_backtrack and unevaluation routines can be overridden by creating the parser with the fast_move_back parameter.

MATCH_ONCE in MULTIPLE Examples

The following parsers all parse the same strings but can take different amounts of steps on the same input.

my $pi = {};

my $mo_1 = new Parse::Stallion(
 {rule1 => A(M(qr/t/), M(qr/t/), qr/u/)});
$result = $mo_1->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 1 parse steps ".$pi->{number_of_steps}."\n"; # should be 157

my $mo_2 = new Parse::Stallion(
 {rule2 => A(M(qr/t/, MATCH_ONCE), M(qr/t/, MATCH_ONCE), qr/u/)});
$result = $mo_2->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 2 parse steps ".$pi->{number_of_steps}."\n"; # should be 15

my $mo_3 = new Parse::Stallion(
 {rule2 => A(M(qr/t/, MATCH_ONCE), M(qr/t/, MATCH_ONCE),
   L(qr/u/, PB(sub {return 0})), MATCH_ONCE)});
$result = $mo_3->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 3 parse steps ".$pi->{number_of_steps}."\n"; # should be 27

my $mo_4 = new Parse::Stallion(
 {rule2 => A(M(qr/t/, MATCH_ONCE), M(qr/t/, MATCH_ONCE),
   L(qr/u/, PB(sub {return 0})), MATCH_ONCE)}, {fast_move_back => 1});
$result = $mo_4->parse_and_evaluate('ttttt',{parse_info => $pi});
print "case 4 parse steps ".$pi->{number_of_steps}."\n"; # should be 15

PARSE_AND_EVALUATE

After setting up a Parse::Stallion parser, strings are parsed and evaluated via parse_and_evaluate. The returned value is the returned value of the root node's evaluation routine.

During a parse, the current position corresponds to how much of the string from left to right has been parsed.

SEARCH, SEARCH_AND_SUBSTITUTE

Similar to perl's regexp match (m/regexp/) and search (s/regexp/new/), two subroutines, search and search_and_substitute, are provided that perform similar functionality with using Parse::Stallion grammars. They contain calls to parse_and_evaluate with certain parameters set, match_length and match_start set to false.

If global is true, search_and_substitute returns the count of the number of substitutions.

EXAMPLES

my $search_parser = new Parse::Stallion(
  {start => A(qr/b+/, qr/c+/, E(sub{return 'x'}))}
);

my $search_result;
$search_result = $search_parser->search('abd');
#search_result is false ('')
$search_result = $search_parser->search('abcd');
#search_result is true (1)
my $search_string = 'abcd';
$search_result = $search_parser->search_and_substitute($search_string);
#$search_result is true (1), $search_string contains 'axd'

PARTIAL STRING PARSES

The parser completes a parse only if the position equals the length unless the parameter match_length is set to false.

Example showing differences

my $one_grammar = {start => qr/1/};
my $parser = new Parse::Stallion($one_grammar);
my $partial_parser = new Parse::Stallion($one_grammar);
$parser->parse_and_evaluate('12');  # does not parse
$partial_parser->parse_and_evaluate('12',
 {match_length=>0});  # parses (returns '1')

START POSITION AND REFERENCE INPUT

The default start position of $input_string is pos $input_string, likely to be 0. One can specify the start_position as a parameter of parse_and_evaluate.

Examples:

my $parser = new Parse::Stallion({n => L(qr/(\d+)\;/,E(sub{$_[0]+1}))});
my $input = '342;234;532;444;3;23;';

my $pi = {final_position => 0};
my $input_length = length($input);
my @results;
while ($pi->{final_position} != $input_length) {
  push @results, $parser->parse_and_evaluate($input,
   {parse_info=> $pi, start_position => $pi->{final_position},
    match_length => 0});
}
# @results should contain (343, 235, 533, 445, 4, 24)

# Making use of Global:

@results = ();
pos $input = 0;
while (my $result = $parser->parse_and_evaluate($input,
 {global => 1})) {
  push @results, $result;
}
# @results should contain (343, 235, 533, 445, 4, 24)

#By making use of list context:
pos $input = 0;
@results = $parser->parse_and_evaluate($input, {global => 1,
 match_length => 0});
# @results should contain (343, 235, 533, 445, 4, 24)

OTHER PARAMETERS

See the examples at the end of this section.

MATCH_START, MATCH_LENGTH

By default the string from the initial position onwards must match the grammar. By setting match_start to 0, the parse can be matched from the start position or greater; this search can take more steps.

Likewise, a parse by default can end only if the position reaches the length. By setting match_length to 0, this needs not be the case; this search can take fewer steps.

match_start set to false is similar to a parse that has a start rule: start_rule => A(M(qr/./,MATCH_MIN_FIRST),rest_of_grammar)

Using match_minimum the strings can be shorter and fewer lines of code are needed to manipulate the result.

MATCH_MINIMUM, MATCH_MAXIMUM

match_minimum = 1 finds the successful parse with the lowest start position with the lowest final position. match_maximum = 1 finds the successful parse with the lowest start position with the greatest final position.

Setting match_minimum or match_maximum sets match_length to false.

SUBSTITUTE, SUBSTITUTION

If the parameter substitute is set, the parsed string is modified from the matched start to matched end with the evaluated result.

By default the substr function is used for this, but one may use the parameter substitution for a different function. Provided so non-string objects can substitute.

GLOBAL, FIND_ALL

global = 1 preserves the position of the string being marched at its final position. One probably wants to set match_length to false if global is true. If parse_and_evaluate has global = 1 used within a list context, the parse and evaluate is repeated until the parse fails and an array of results is returned.

EXAMPLES

my $plus_times_parser = new Parse::Stallion({
  start => A('term',M(A(qr/\+/,'term')),
       E(sub {my $i = 0; $i += $_ for @{$_[0]->{term}}; return $i;})),
  term => A('expression',M(A(qr/\*/,'expression')),
       E(sub {my $i = 1; $i *= $_ for @{$_[0]->{expression}}; return $i;})),
  expression => O(A(qr/\(/,'start',qr/\)/, E(sub {return $_[0]->{start}})),
   L(qr/\d+/,E(sub {return $_[0]}))),
 },
  {start_rule => 'start'});

my $result = $plus_times_parser->parse_and_evaluate('(3+5*2)*2+4*3');
#$result should contain 38

$result = $plus_times_parser->parse_and_evaluate('example:(3+5*2)*2+4*3');
#$result should be undef

$result = $plus_times_parser->parse_and_evaluate('example:(3+5*2)*2+4*3',
 {match_start => 0});
#$result should contain 38

$result = $plus_times_parser->parse_and_evaluate('(3+5*2)*2+4*3',
 {match_length => 0});
#$result should contain 38 because of how grammar is declared

$result = $plus_times_parser->parse_and_evaluate('(3+5*2)*2+4*3ttt',
 {match_length => 0});
#$result should contain 38

$result = $plus_times_parser->parse_and_evaluate('(3+5*2)*2+4*3',
 {match_minimum =>1});
#$result should contain 13

my $string_with_numbers = '7*8 is greater than 4+3+4 greater than 2*5';
$plus_times_parser->search_and_substitute($string_with_numbers);
#$string_with_numbers should be '56 is greater than 4+3+4 greater than 2*5';

$string_with_numbers = '7*8 is greater than 4+3+4 greater than 2*5';
$plus_times_parser->search_and_substitute($string_with_numbers,
 {global=>1});
#$string_with_numbers should be '56 is greater than 11 greater than 10';

my $choice_parser = new Parse::Stallion({
  start => O(qr'bc', qr'abcdef', qr'abcd', qr'abcde', qr'abc', qr'de')});

$result = $choice_parser->parse_and_evaluate('abcd');
#result should be "abcd"

$result = $choice_parser->parse_and_evaluate('abcdex');
#result should be undef

$result = $choice_parser->parse_and_evaluate('abcdex', {match_minimum => 1});
#result should be "abc"

$result = $choice_parser->parse_and_evaluate('abcdex', {match_maximum => 1});
#result should be "abcde"

$result = $choice_parser->parse_and_evaluate('abcdex', {match_minimum => 1,
 match_start => 0});
#result should be "abc"

@result = $choice_parser->parse_and_evaluate('abcdex', {match_minimum => 1,
  global => 1});
#@result should contain ("abc", "de")

my $p_and_e_string = 'abcdex';
my $result_1 = $choice_parser->parse_and_evaluate($p_and_e_string,
 {match_minimum => 1, global => 1});
my $result_2 = $choice_parser->parse_and_evaluate($p_and_e_string,
 {match_minimum => 1, global => 1});
#$result_1 should contain "abc", $result_2 should contain "de"

$result = $choice_parser->parse_and_evaluate('tabcdex', {match_minimum => 1});
#result should be undef

$result = $choice_parser->parse_and_evaluate('tabcdex', {match_maximum => 1});
#result should be undef

$result = $choice_parser->parse_and_evaluate('tabcdex', {match_minimum => 1,
 match_start => 0});
#result should be "abc"

$result = $choice_parser->parse_and_evaluate('tabcdex', {match_maximum => 1,
 match_start => 0});
#result should be "abcde"

$result = $choice_parser->parse_and_evaluate('abcdex',
 {match_length => 0});
#result should be undef

RETURNED VALUES

One can check if the returned value is 'undef' to determine if an evaluation failed; an empty string is returned if parsing succeeds but evaluation results in an undef. Also one can look at $parse_info->{parse_succeeded} which has a value even if there is no evaluation.

PARSE_INFO and PARSE_TRACE

To get details on the parsing from parse_and_evaluate, there are two optional parameters: parse_info that receives a hash ref and parse_trace that receives an array ref. These ref's are filled in during parse_and_evaluate.

my $parse_info = {};
my $parse_trace = [];
my $value = $parser->parse_and_evaluate($string_to_parse,
 {parse_info=>$parse_info, parse_trace => $parse_trace});

$parse_info->{parse_succeeded}; # true (1) if string parses
$parse_info->{tree}; # Root of resultant parse tree
$parse_info->{number_of_steps}; # Number of steps taken
$parse_info->{start_rule}; # Start rule used for this parse
$parse_info->{final_position};
$parse_info->{final_position_rule}; # Last rule looked at
$parse_info->{maximum_position}; # Maximum position in parse
$parse_info->{maximum_position_rule}; # First rule at maximum position
$parse_info->{parse_backtrack_value};
 # 0 unless parse backtrack call ends parse

An entry in $parse_trace contains:

$parse_trace->[$step]->{rule_name}
$parse_trace->[$step]->{moving_forward} # 0 if backtracking
$parse_trace->[$step]->{moving_down} # 0 if moving up parse tree
$parse_trace->[$step]->{current_position}
$parse_trace->[$step]->{node_creation_step} # for current node
$parse_trace->[$step]->{parent_node_creation_step}
$parse_trace->[$step]->{message} # informative message on previous step
$parse_trace->[$step]->{tree} # stringified snapshot of parse tree

PARSE_TRACE_ROUTINE

A parse_trace_routine can be passed in as a parameter that is called once at the beginning of every parse step. This overrides the parse_trace parameter, because this uses a default parse_trace_routine to record the information for each parse step.

Example:

my $a_grammar = new Parse::Stallion({ start => M(qr/a/) });

my $result = $a_grammar->parse_and_evaluate('aab',
  {parse_trace_routine => sub {
    print 'at step '.${$_[0]->{__steps_ref}}."\n";
    print 'moving forward is '.${$_[0]->{__moving_forward_ref}}."\n";
    print 'position is '.${$_[0]->{__current_position_ref}}."\n";
    }
   }
 );
 # will print trace of parse

STRINGIFIED PARSE TREE

The tree is a Parse::Stallion object having a function, that converts a tree into a string, each node consisting of one line:

$parse_info->{tree}->stringify({values=>['name','parse_match']});

Internally generated node names, from rules generated by breaking up the entered rules into subrules, will show up. The module Parse::Stallion::EBNF shows the grammar with these generated subrules. The node keys described in the section Parse Tree Nodes can be passed to stringify.

PARSE STEPS

A step is an action on a node, roughly speaking matching a regex for a 'leaf' node, or moving forward or backtracking from a node. Each step is listed in the optional parse_trace.

If the parsing reaches the maximum number of steps the parse fails (croak). The maximum number of steps can be changed via max_steps. If max_steps is set to a negative number, there is no limit on the number of steps.

$stallion->parse_and_evaluate($string, {max_steps=>200000});

"LEFT RECURSION"

Parse::Stallion checks the grammar for "left recursion" and will not build the grammar if left recursion is detected (croaks). However, there are some cases which are not possible to detect, i.e. whether a parse forward routine will change the position or a case where a regexp matches but returns an empty string such as qr/\B/ .

Parse::Stallion may encounter "left recursion" during parsing in which case the parsing stops and a message is 'croak'ed.

"Left recursion" occurs during parsing when the same non-'leaf' rule shows up a second time on the parse tree at the same position.

Illegal Case 1:

expression => AND('expression', 'plus', 'term')

Expression leads to expression leads to expression ....

Illegal Case 2:

rule_with_empty => AND('empty', 'rule_with_empty', 'other_rule')
empty => qr//

The second case is detected while building the grammar, regexp's are checked to see if they match the empty string.

Illegal Case 3:

rule_with_pf => A(L(PF(sub {return 1}))', 'rule_with_pf',
 'other_rule')

The third case will be detected during parsing.

EVALUATION

Evaluation can occur during or after parsing.

Each rule may have an evaluation subroutine tied to it. When defining a rule if there is no subroutine, a default "do nothing" subroutine is provided. Nested subrules may have a subroutine tied to them though by default they have none.

When setting up a rule, one can specify the EVALUATION subroutine by the parameter 'EVALUATION' or 'E' which in turn takes one parameter, the evaluation subroutine for that node.

Each node has a computed value that is the result of calling its evaluation routine. The returned value of the parse is the computed value of the root node.

There are two parameters to a node's evaluation routine.

The first parameter to the evaluation routine is either a hash or a string.

If the node is a leaf regexp that has a parenthesized match inside, what is matched by the first parenthesized match is the parameter. Else if the node is a leaf then what is matched by the leaf is the first parameter. Else if 'USE_STRING_MATCH' has been set for the node's rule, the substring equivalent to a join of all the matched strings of the nodes' descendants is the parameter.

For other internal nodes, the first parameter is a hash. The hash's keys are the named subrules of the node's rule, the values are the computed value of the corresponding children nodes. If a key could repeat, the value is an array reference.

The second parameter to an evaluation routine is the "parse_hash", see the section entitled Parse Hash.

By nesting a rule with an alias, the alias is used for the name of the hash parameter instead of the rule name.

Parse Hash

The parse hash is a hash ref that is passed to the evaluation, unevaluation, parse_forward, and parse_backtrack routines. It is the same hash ref throughout a specific parse so one can store values there to pass among the routines. One can pass in a hash ref, with some preset keys, to be used as the parse_hash for a given parse_and_evaluate call.

The following keys are set by the module, changing them may break the parse,

parse_this_ref # Reference to object being parsed
current_position # The current position of the object being parsed
                 #, not set for evaluation after parsing
parent_node # Node in tree for child to be created for parse_forward
            # and parse_backtrack, child has been removed before call
            # to parse_backtrack; see Parse Tree Nodes
current_node # active node in parse tree for evaluation, unevaluation,
             # see section Parse Tree Nodes
parse_match # for parse_backtrack routine, match from parse_forward
rule_name # Name of rule, may be internally generated
rule_info # Hash of rule names which have RULE_INFO set
#The following are used by the default parse_trace_routine when
#given the parameter parse_trace
__current_node_ref # Reference to current node
__current_node_name_ref # Reference to current node's rule name
__moving_forward_ref # Reference to if parse is moving forward
__moving_down_ref # Reference to if parse is moving down
__current_position_ref # Reference to current position
__message_ref # Reference to notes on previous parse step
__tree # Currently created tree
__steps_ref # Reference to number of steps taken

Any future parameters will be given double underscores at the start of their names and internal ones are not listed here.

The keys parent_node, current_node, parse_match, current_position, and rule_name are not set if only a parse_trace routine is used.

Parse Tree Nodes

A parse tree node is a hash reference composed of:

name, # Name of rule, this may be internally generated
alias, # Alias of rule in evaluation phase parameter list
steps, # Number of steps in parse when node created
parent, # Node's parent hash reference
position_when_entered, # Position when node created
position_when_completed, # Position when node completely parsed
children, # Array reference to nodes of children (undef if leaf)
child_count, # scalar(@children)
parse_match, # matched leaf string or match returned by parse_forward

In addition, there will be some keys beginning with double underscore (__).

Parse tree nodes are also passed to the parse forward and parse backtrack subroutines.

One should not change the values of a parse node hash. However, one is allowed to add key/values to the hash (avoid key names with double underscores).

Evaluation Examples

Comments refer to the parameters of the evaluation subroutine.

LEAF(qr/d\w+/, E(sub {...})) # word starting with d

L(qr/(d)\w+/, E(sub {...})) # $_[0] = 'd'

L(qr/\s*(\w+)\s*/, E(sub {...})) # white space trimmed

A(qr/a/, qr/b/, E(sub {...})) #$_[0]->{''}->['a','b']

A(qr/a/, qr/b/, E(sub {...}), USE_STRING_MATCH) # $_[0] = 'ab'

A({f=>qr/a/}, {f=>qr/b/}, E(sub {...})) # $_[0]->{'f'} = ['a','b']

A('rule1', 'rule2', 'rule3', E(sub {...}))
 #params are $_[0]->{rule1}, $_[0]->{rule2}, $_[0]->{rule3}

MATCHED_STRING

A function MATCHED_STRING is provided that returns the string matched by the node/node's descendents. It takes as an argument the $parse_hash, the second parameter.

rule => A({sub_rule_1 => qr/art/}, {sub_rule_2 => qr/hur/},
 E(sub {$matched_string = MATCHED_STRING($_[1]); ...
  # $matched_string == 'arthur' == $_[0]->{sub_rule_1} . $_[0]->{sub_rule_2}
  }));

LOCATION

A function, LOCATION is provided that takes a string reference and a position in the string and computes the line number and tab value of the given position.

my ($line, $tab);
my $loc_grammar = {
  start =>
   A(qr/....../s,
     L(qr//, E(
       sub {
        ($line, $tab) = LOCATION($_[1]->{parse_this_ref},
         $_[1]->{current_node}->{position_when_entered})
       })),
     qr/.*/s)
};
my $loc_parser = new Parse::Stallion($loc_grammar);
$loc_parser->parse_and_evaluate("ab\nd\nfghi");
# $line == 3,  $tab == 2

One can also use this function after a parse to determine where the
maximum position occured in the input string.

EVALUATION AFTER PARSING

In evaluation after parsing, the default, Parse::Stallion will evaluate the parse tree in a bottom up left to right traversal.

EVALUATION (AND UNEVALUATION) DURING PARSING

If the do_evaluation_in_parsing is set when a Parse::Stallion object is created the evaluation occurs during the parsing instead of afterwards. Alternatively, if there exists any UNEVALUATION routine, the evaluation is done during parsing.

Every time a node is matched, its evaluation routine is called. The parse_hash will have the key current_position set and the node will not yet have the key position_when_completed.

The evaluation routine may return a second parameter that tells Parse::Stallion to reject or not reject the match. This allows more control over what can be parsed.

The same node may be evaluated several times due to backtracking. One must be careful either to not change the references/objects passed in to the evaluation routine or to undo changes in an unevaluation routine. An unevaluation routine is called when backtracking reverts back to a node, the parameters are the same as for the evaluation routine. UNEVALUATION (alternatively U) are used to set the subroutine.

The following example shows a 'leaf' rule that matches all words except for those marked in the hash %keywords:

our %keywords = ('key1'=> 1, 'key2' => 1);
my %g = (
 start => A('leaf', qr/\;/),
 leaf => L(
   qr/\w+/,
   E(sub {if ($keywords{$_[0]}) {return (undef, 1)} return $_[0]}),
 ),
);
my $parser = new Parse::Stallion(\%g, {do_evaluation_in_parsing=>1});
$parser->parse_and_evaluate('key1;'); #should return false
$parser->parse_and_evaluate('key3;'); #should return true
 # ( {''=>';',leaf=>'key3'} )

EVALUATION DURING PARSING EXAMPLE

In this example, the first statement tells a truth about the number of elements in a list, the second tells whether or not the first statement is true. If the second statement is true, the string parses.

my %parsing_rules = (
 start_expression =>
  AND('two_statements', qr/\z/,
   E(sub {return $_[0]->{'two_statements'}})
 ),
 two_statements =>
   A('list_statement','truth_statement',
    EVALUATION(sub {
     if ($_[0]->{list_statement} != $_[0]->{truth_statement}) {
       return (undef, 1);
     }
     return 1;
   })
 ),
 list_statement =>
   A('count_statement', 'list',
    E(sub {
     if ($_[0]->{count_statement} == scalar(@{$_[0]->{list}})) {
       return 1;
     }
     return 0;
   })
 ),
 count_statement =>
   A(qr/there are /i,'number',L(qr/ elements in /),
    E(sub { return $_[0]->{number}; })
  ),
 number => qr/\d+/ ,
 list => A('number', M(A(qr/\,/}, 'number')),
   EVALUATION(sub {return $_[0]->{number}})
 ),
 truth_statement =>
   O({t => qr/\. that is the truth\./},
    {t=>qr/\. that is not the truth\./},
    E(sub {
     if ($_[0]->{t} =~ /not/) {
       return 0;
     }
     return 1;
   })
 ),
);

my $how_many_parser = new Parse::Stallion(
 \%parsing_rules,
 {do_evaluation_in_parsing => 1,
  start_rule => 'start_expression',
});

$result = $how_many_parser->parse_and_evaluate(
  "there are 5 elements in 5,4,3,2,1. that is the truth.");

print "$result should be 1\n";

$result = $how_many_parser->parse_and_evaluate(
  "there are 5 elements in 5,4,3,1. that is not the truth.");

print "$result should be 1\n";

$result = $how_many_parser->parse_and_evaluate(
  "there are 5 elements in 5,4,3,1. that is the truth.");

print "$result should be undef\n";

DEFAULT EVALUATION ROUTINE

If a rule does not have an evaluation routine specified, it is as if a generic subroutine is used.

Generic Evaluation Routine 1

  • If the passed in hash reference has only one key, then the value of that key in the hash reference is returned.

  • If the passed in hash reference has more than one key, then the hash reference is returned.

It is as if this routine were run:

sub {
  my $parameters = shift;
  if (ref $parameters eq 'HASH') {
    if (keys %$parameters == 1) {
      my ($key) = keys %$parameters;
      return $parameters->{$key};
    }
  }
  return $parameters;
}

Generic Evaluation Routine 2

If do_not_compress_eval is set when creating the parser, it is as if this routine were run:

sub {
  return shift;
}

Example:

my %n_eval_rules = (
 start_rule => A('term',
  M(A({plus=>qr/\s*\+\s*/}, 'term'))),
 term => A({left=>'number_or_x'},
  M(A({times=>qr/\s*\*\s*/},
   {right=>'number'}))),
 number_or_x => O('number',qr/x/),
 number => qr/\s*\d*\s*/
);

my $n_eval_parser = new Parse::Stallion(\%n_eval_rules,
 {do_not_compress_eval => 0});

$result = $n_eval_parser->parse_and_evaluate("7+4*8");

#$result contains:
{ 'plus' => [ '+' ],
  'term' => [ '7',
              { 'left' => '4',
                'right' => [ '8' ],
                'times' => [ '*' ] } ] }

my $dnce_n_eval_parser = new Parse::Stallion(\%n_eval_rules,
 {do_not_compress_eval => 1});

$result = $dnce_n_eval_parser->parse_and_evaluate("7+4*8");

#$result contains:
{ 'plus' => [ '+' ],
  'term' => [ { 'left' => {number => '7'} },
              { 'left' => {number => '4'},
                'right' => [ {number => '8'} ],
                'times' => [ '*' ] } ] }

Parameter types to Evaluation Routines

If a named parameter could appear more than once, it is passed as an array, else as a scalar. Being passed as an array could be caused by either:

  1. the name being within a 'MULTIPLE' rule which does not have maximum children of 1

  2. occurring more than once within the subrules of an 'AND' rule.

The routine which_parameters_are_arrays returns a hash of the possible values passed to an evaluation routine. For a given key, if the value is 1, the key would be passed to the evaluation routine as an array, if the value is 0, it would be passed as a scalar.

MORE COMPLICATED EVALUATION EXAMPLE

The following is a simple calculator:

 %calculator_rules = (
  start_expression => A(
    'expression', qr/\z/,
    E(sub {return $_[0]->{expression}})
   ),
  expression => A(
    'term',
     M(A('plus_or_minus', 'term')),
    E(sub {my $to_combine = $_[0]->{term};
     my $plus_or_minus = $_[0]->{plus_or_minus};
     my $value = $to_combine->[0];
     for my $i (1..$#{$to_combine}) {
       if ($plus_or_minus->[$i-1] eq '+') {
         $value += $to_combine->[$i];
       }
       else {
         $value -= $to_combine->[$i];
       }
     }
     return $value;
    })
   ),
  term => A(
    'number',
     M(A('times_or_divide', 'number')),
     E(sub {my $to_combine = $_[0]->{number};
     my $times_or_divide = $_[0]->{times_or_divide};
     my $value = $to_combine->[0];
     for my $i (1..$#{$to_combine}) {
       if ($times_or_divide->[$i-1] eq '*') {
         $value *= $to_combine->[$i];
       }
       else {
         $value /= $to_combine->[$i]; #does not check for zero
       }
     }
     return $value;
    })
  ),
  number => L(
    qr/\s*[+\-]?(\d+(\.\d*)?|\.\d+)\s*/,
    E(sub{ return $_[0]; })
  ),
  plus_or_minus => qr/\s*([\-+])\s*/,
  times_or_divide => qr/\s*([*\/])\s*/
 );

 $calculator_parser = new Parse::Stallion(\%calculator_rules,
   {start_rule => 'start_expression'
 });

 my $result = $calculator_parser->parse_and_evaluate("3+7*4");
 # $result should contain 31

my $array_p = $calculator_parser->which_parameters_are_arrays('term');
# $array_p would be {number => 1, times_or_divide => 1}

$array_p = $calculator_parser->which_parameters_are_arrays(
 'start_expression');
# $array_p would be {expression => 0}

STRING_EVALUATION

Instead of passing an evaluation subroutine, one can pass in the string of a subroutine via STRING_EVALAUATION (or SE). The string is modified so that the passed in parameters become local variables. The second parameter is optional.

Example:

a_rule => A('x_1', 'x_2', 'x_3', 'x_1'),
  SE('#comment', 'xx'));

results in the evaluation routine equivalent to:

sub {
$x_1 = $_[0]->{x_1};
$x_2 = $_[0]->{x_2};
$x_3 = $_[0]->{x_3};
$xx = MATCHED_STRING($_[1]);
#comment
}

If the rule is a leaf then $_ is set to the parameter $_[0]; else, $_ is the one unnamed rule or an array ref of the unnamed rules:

$_ = $_[0]->{''};

NO EVALUATION

Evaluation can be prevented by setting no_evaluation when the grammar is created. The return of parse_and_evaluate is always undef in this case but parse_info can be used to view the parse results.

LEAF DETAILS

Leaf rules can be set up as follows:

LEAF($leaf_arg, PARSE_FORWARD(sub{...}), PARSE_BACKTRACK(sub{...}),
 EVALUATION(sub{...}), UNEVALUATION(sub{...}), LEAF_DISPLAY($display));

If $leaf_arg is a Regexp, it is converted into a hash ref: {regex_match => $leaf_arg} for internal purposes.

If a default PARSE_FORWARD is not provided then a regexp match is attempted on the string being parsed at the current_position. Default parse_forward and parse_backtrack subroutines can be provided for leaves.

The subroutine in PARSE_FORWARD (or PF) is called when moving forwards during the parse. Its one parameter is a hash described in the section entitled Parse Hash above.

If the parsing should continue forward it should return an array with the first argument true (1), the second argument the "parse_match" to store at the node, and an optional third argument being the non-negative change in the parse position. The second and third arguments correspond to the substring matched and the length of the substring matched when parsing a string. If parsing should not continue, parse_forward should return 0.

The subroutine in PARSE_BACKTRACK (or PB) is called when backtracking has destroyed a (leaf) node. Its one parameter is the pares_hash.

The parse backtrack routine should normally return a false value. If it returns true, the parsing immediately ends in failure. The value ending the parse is returned in parse_backtrack_value. This can be used to set up a rule

pass_this_no_backtrack => L(qr//,PB(sub{return 1}))

that if encountered during parsing during a backtrack means that the parsing will end.

EVALUATION and UNEVALUATION are explained in the section 'EVALUATION'.

LEAF_DISPLAY

See Parse::Stallion::EBNF for this exported keyword.

PARSE_FORWARD and PARSE_BACKWARD on non-leaf nodes

It is possible to create PARSE_FORWARD and PARSE_BACKWARD routines on non-leaves. The PARSE_FORWARD routine executes before node creation, while the PARSE_BACKWARD routine executes after node destruction. They may change the position of the parse and the returned value of the match is stored in the node's hash "parse_match" key.

INCORPORATE

Rules from other parsers' grammars may be added to a new grammar by the parameter INCORPORATE. The prefix, which may be an empty string, is prepended to every rule name in the incorporated grammar, this can avoid conflicting rule names. Either the start rule of the incorporated grammar is a subrule in the new grammar or unreachable rules should be set.

Example:

 my %number = (number => qr/[+\-]?(\d+(\.\d*)?|\.\d+)/);

 my $number_parser = new Parse::Stallion(\%number);

 my %grammar_i = (
  expression =>
   A('number', qr/\s*\+\s*/, decimal_number,
    E(sub {return $_[0]->{number} + $_[0]->{decimal_number}})
  ),
  number => qr/\d+/
 );

 my $parser_i = new Parse::Stallion(
  \%grammar_i,
  {incorporate => [{grammar_source=>$number_parser, prefix=>'decimal_'}]}
 );

my $results_i = $parser_i->parse_and_evaluate('4 + 5.6');
#$results_i should contain 9.6

OTHER PARSING NOTES

PARSING NON-STRINGS

Four subroutines may be provided: a default 'leaf' rule matcher/modifier for when the parser is moving forward and a default 'leaf' rule "unmodifier" for when the parser is backtracking.

The third subroutine, length_routine, should return the final position of a successful parse for a given object.

my $object_parser = new Parse::Stallion(\%grammar, {
  ...
  parse_forward =>
   sub {
     my $parameters = shift;
     ...
     return ($true_if_object_matches_rule,
      $value_to_store_in_leaf_node,
      $value_equal_or_greater_than_current_position);
   },
  parse_backtrack =>
   sub {
     my $parameters = shift;
     ...
     return; #else parsing halts
    },
  length_routine =>
   sub {my ($object_ref) = @_;
      ...
     return $length_of_object_ref;
   },
});

When evaluating the parse tree, the parameters to the 'leaf' nodes are the values returned in parse_forward.

The script object_string.pl in the example directory demonstrates this section.

'LEAF' LEAF PARSE FORWARD/BACKTRACK

All 'leaf' rules need to be set up such that when the parser is moving forward and reaches a 'leaf', the 'leaf' rule attempts to match the object being parsed at the current position. If there is a match, then the current_position may increase.

When backtracking, the object being parsed should be reverted, if changed, to the state before being matched by the 'leaf' rule.

NON_DECREASING POSITION

The third value returned from parse_forward is the change in position, it must be non-negative.

The position is used to detect and prevent "left recursion" by not allowing a non-'leaf' rule to repeat at the same position. 'MULTIPLE' rules are prevented from repeating more than once at the same position.

The position also cuts down on the number of steps by allowing the parser to not repeat dead-end parses. If during the parse, the same rule is attempted a second time on the parse object at the same position, and the first parse did not succeed, the parser will begin backtracking.

STRINGS

By default, strings are matched, which, if a reference to the string instead of the string is passed in to parse_and_evaluate, is similar to that found in the test case object_string.t.

my $calculator_stallion = new Parse::Stallion({
  ...
  parse_forward =>
   sub {
    my $parameters = shift;
    my $input_string_ref = $parameters->{parse_this_ref};
    my $rule_info = $parameters->{rule_info};
    my $rule_name = $parameters->{rule_name};
    my $rule_definition = $rule_info->{$rule_name};
    my $m = $rule_definition->{nsl_regex_match};
    if ($$input_string_ref =~ s/\A($m)//) {
      return (1, $1, length($1));
    }
    return 0;
   },

  parse_backtrack =>
   sub {
    my $parameters = shift;
    my $input_string_ref = $parameters->{parse_this_ref};
    my $stored_value = $parameters->{parse_match};
    if (defined $stored_value) {
      $$input_string_ref = $stored_value.$$input_string_ref;
    }
    return;
   },

  length_routine =>
   sub {
     return length(${$_[0]});
   }
});

PARSING VERY LARGE INPUTS

In a normal course of parsing, the input string is split up and stored at each leaf node. If the input size is very large, this could cause memory problems. Note, the input string is not copied, space will be consumed by the leaves containing all the substrings and the parse tree itself.

Below are some mostly untested tips on how this can be handled:

NOT STORING VALUES IN LEAFS

By allowing the first parenthesized expression of a leaf to be empty, i.e.: qr/()leaf_reg_ex/, then the empty string will be stored at that leaf. When evaluating the leaf, the first parameter will be the empty string. One can make use of USE_STRING_MATCH.

REMOVE NODES FROM TREE DURING PARSE

One need be very careful to try this. If the parse is always moving forward and will never backtrack then after a node is evaluated, i.e. if do_parsing_in_evaluation=1, then the children of that node are no longer needed. Since the evaluation routine has access to the node and the node's children, one can delete the {children} key from the node when a node is completed. This would keep the parse tree small.

EXPORT

The following are EXPORTED from this module:

A AND E EVALUATION L LEAF LEAF_DISPLAY LOCATION MATCHED_STRING
MATCH_MIN_FIRST MATCH_ONCE M MULTIPLE O OPTIONAL OR PARSE_FORWARD
PARSE_BACKTRACK PB PF RULE_INFO R SE STRING_EVALUATION TERMINAL TOKEN U
UNEVALUATION USE_STRING_MATCH Z ZERO_OR_ONE

PERL Requirements

Parse::Stallion's installation uses Test::More and Time::Local requiring perl 5.6 or higher. Parse::Stallion should work with earlier versions of perl, neither of those modules is required outside of the test cases for installation.

VERSION

2.01

AUTHOR

Arthur Goldstein, <arthur@acm.org>

NAME

The name Stallion is a mashing of the words 'STRING' and 'TREE' of 'AND', 'LEAF', and 'OR' 'NODES'.

ACKNOWLEDGEMENTS

Damian Conway, Christopher Frenz, and Greg London.

COPYRIGHT AND LICENSE

Copyright (C) 2007-12 by Arthur Goldstein. All Rights Reserved.

This module is free software. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)

BUGS

Please email in bug reports.

TO DO AND FUTURE POSSIBLE CHANGES

Unordered And.

Remove parsing of non-strings?

If unknown parameter is passed in, flag it?

SEE ALSO

example directory (includes stallion.html, a javascript translation of an earlier version of Parse::Stallion) and test case directory t

Parse::Stallion::EBNF. Outputs grammars in more readable form. Also contains an example of how one could input a grammar from a string, similar to how many parsers take their input.

Parse::Stallion::CSV. Example of how to create a parser from specification.

Perl 6 grammars.

lex, yacc, ..., other parsers.