NAME
Language::FormulaEngine::Parser - Create parse tree from an input string
VERSION
version 0.08
SYNOPSIS
my $parse_tree= Language::FormulaEngine::Parser->new->parse($string);
DESCRIPTION
This class scans tokens from an input string and builds a parse tree. In compiler terminology, it is both a Scanner and Parser. It performs a top-down recursive descent parse, because this is easy and gives good error messages. It only parses strings, but leaves room for subclasses to implement streaming. By default, the parser simply applies a Grammar to the input, without checking whether the functions or variables exist, but can be subclassed to do more detailed analysis during the parse.
The generated parse tree is made up of Function nodes (each infix operator is converted to a named function) and each Function node may contain Symbols, Strings, Numbers, and other Function nodes. The parse tree can be passed to the Evaluator for instant execution, or passed to the Compiler to generate an optimized perl coderef. The parse tree is lightweight, and does not include token/context information; this could also be added by a subclass.
PUBLIC API
parse
Parse a new input text, updating all derived attributes with the result of the operation. It returns the value of "parse_tree" (which is undef if the parse failed). On failure, the exception is stored in "error" and other attributes like "token_pos" may contain useful diagnostic information.
parse_tree
This holds the generated parse tree, or undef
if the parse failed. See "Parse Nodes".
error
This is undef
if the parse succeeded, else an error message describing the syntax that ended the parse.
functions
A set (hashref) of all function names encountered during the parse.
symbols
A set (hashref) of all non-function symbols encountered. (variables, constnts, etc.)
reset
Clear the results of the previous parse, to re-use the object. Returns $self
for chaining.
deparse
my $formula_text= $parser->deparse($tree);
Return a canonical formula text for the parse tree, or a parse tree that you supply.
EXTENSIBLE API
These methods and attributes are documented for purposes of subclassing the parser.
input
The input string being scanned. Code within the parser should access this as $self->{input}
for efficiency.
input_pos
Shortcut for pos($self->{input})
.
token_type
Type of current token scanned from input
. Code within the parser should access this as $self->{token_type}
for efficiency.
token_value
Value of current token scanned from input
, with escape sequences and etc resolved to a sensible perl value. Code within the parser should access this as $self->{token_value}
for efficiency.
token_pos
An offset within input
where this token started. Code within the parser should access this as $self->{token_pos}
for efficiency.
next_token
Advance to the next token, replacing the values of token_
variables and updating input_pos
. Returns the token_type, of which all are true except EOF which has a type of 0
, so this also means the function returns true if it parsed a token and false if it reached EOF. It dies if no token could be parsed. If you call next_token again after the eof token, it throws an exception.
This method is a wrapper around "scan_token". Override that method to add new token types.
scan_token
Pattern-match the next token, and either return $type => $value
or an empty list if the syntax is invalid. This is intended to be overridden by subclasses.
consume_token
return $self->consume_token if $self->{token_type} eq $desired_type;
This is a shorthand for returning the current token_value
while also calling next_token
.
token_context
my $text= $self->token_context(%options);
Default behavior generates a string like:
"'blah blah' on line 15, char 12"
Passing token_context(multiline => 1)
generates a string like
"Expected something else at line 15, char 16\n" .
"blah blah blah token blah blah\n" .
" ^^^^^\n"
Multiline additionally takes arguments as described in "format_context_multiline" in Language::FormulaEngine::Parser::ContextUtil.
GRAMMAR
Parse Rules
The default grammar implements the following rules:
expr ::= or_expr
or_expr ::= and_expr ( 'or' and_expr )*
and_expr ::= not_expr ( 'and' not_expr )*
not_expr ::= ( 'not' | '!' ) cmp_expr | cmp_expr
cmp_expr ::= sum_expr ( ( '=' | '==' | '<>' | '\u2260' | '<' | '<=' | '>' | '>=' ) sum_expr )*
sum_expr ::= prod_expr ( ('+' | '-') prod_expr )*
prod_expr ::= ( unit_expr ('*' | '/') )* unit_expr
unit_expr ::= '-' unit_expr | Identifier '(' list ')' | '(' (expr|list) ')' | Identifier | Number | String
list ::= expr ( ',' expr )* ','?
ident
, num
, str
, and all the punctuation symbols are tokens.
The parser uses a Recursive Descent algorithm implemented as the following method calls. Each method consumes tokens from $self
and return a "PARSE NODES":
- parse_expr
- parse_or_expr
- parse_and_expr
- parse_not_expr
- parse_cmp_expr
- parse_sum_expr
- parse_prod_expr
- parse_unit_expr
- parse_list
Token Types
'Number'
-
All the common decimal representations of integers and floating point numbers which perl can parse. Optional decimals and decimal point followed by decimals and optional exponent, ending at either the end of the input or a non-alphanumeric.
'String'
-
A single-quoted or double-quoted string, treating a double occurrence of the quote character to mean a literal quote character. ("Pascal style")
'apostrophes are''nt hard'
There are no escape sequences though, so to get control characters or awkward unicode into a string you need something like:
concat("smile ",char(0x263A))
which depends on those functions being available in the namespace.
- Keywords...
-
Keywords include the "word" tokens like 'OR', but also every text literal seen in a parse rule such as operators and punctuation. The
token_type
of the keyword is the canonical version of the keyword, and thetoken_value
is the actual text that was captured. The pattern matches the longest keyword possible. 'Identifier'
-
Any alpha (or underscore) followed by any run of alphanumerics, (including underscore and period).
Customizing the Token Scanner
The tokens are parsed using a series of regex tests. The regexes and the code that handles a match of that regex are found in package attribute "scanner_rules". These regexes and code fragments get lazily compiled into a package method on the first use (per package). Meanwhile, several of those regex are built from other package attributes.
- scanner_rules
-
This package method returns a list (not arrayref) of ordered elements of the form
[ $name, $regex, $code_fragment, \%vars ]
. You can subclass this method to inspect the rules (probably based on$name
) and replace the regexes, or alter the handler code, or add/remove your own rules. The regexes are attempted in the order they appear in this list. You do not need to use "\G" or "/gc" on these regexes because those are added automatically during compilation. - keyword_map
-
This package method returns a hashref of all known keywords, mapped to their canonical form. So for instance, a key of
'<>'
with a value of'!='
. These tokens automatically become the scanner rule namedKeywords
. In turn, the contents of this hashref include the "cmp_operators", "math_operators", "logic_operators", and "list_operators" which can be overridden separately.This method is called once during the compilation of "scan_token", and the result is then made into a constant and referenced by the compiled method, so dynamic changes to the output of this method will be ignored.
- cmp_operators
-
Package method that returns a list of comparison operators, like '<', '>=', etc.
- math_operators
-
Package method that returns a list of math operators, like '*', '+', etc.
- logic_operators
-
Package method that returns a list of keywords like 'and', 'or', etc.
- list_operators
-
Package method that returns a list of '(', ')', ','
Parse Nodes
The parse tree takes a minimalist approach to node classification. In this default implementation, number values, string values, and symbolic references have just a simple wrapper around the value, and function calls are just a pair of function name and list of arguments. All language operators are represented as function calls.
A blessed node only needs to support one method: ->evaluate($namespace)
.
The class name of the blessed nodes should be ignored. A function is anything which can("function_name")
, a string is anything which can("string_value")
, a number is anything which can("number_value")
and a symbolic reference is anything which can("symbolic_name")
.
Subclasses of Parser should implemnt new node types as needed. You probable also need to update "deparse".
The parser rules (parse_X_expr
methods) create nodes by the following methods on the Parser class, so that you can easily subclass Parser
and override which class of node is getting created.
- new_call
-
$node= $parser->new_call( $function_name, $parameters );
Generate a node for a function call. The returned node has attributes
function_name
andparameters
- new_symbol
-
$node= $parser->new_symbol($symbol_name);
A reference to a symbolic value (i.e. variable or constant). It has one attribute
symbol_name
. - new_string
-
$node= $parser->new_string($string_value);
A string literal. It has an attribute
string_value
holding the raw value. - new_number
-
$plain_scalar= $parser->new_number($value);
A numeric constant. It has an attribute
number_value
holding the raw value. - get_negative
-
$negative_node= $parser->get_negative( $node );
Utility method to get the "opposite of" a parse node. By default, this wraps it with the function
'negative'
, unless it already was that function then it unwraps the parameter. It performs simple negation on numbers.
AUTHOR
Michael Conrad <mconrad@intellitree.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2023 by Michael Conrad, IntelliTree Solutions llc.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.