NAME
Parse::FSM::Lexer - Companion Lexer for the Parse::FSM parser
SYNOPSIS
use Parse::FSM::Lexer;
$lex = Parse::FSM::Lexer->new;
$lex = Parse::FSM::Lexer->new(@files);
$lex->add_path(@dirs); @dirs = $lex->path;
$full_path = $lex->path_search($file);
$lex->from_file($filename);
$lex->from_list(@input); 
$lex->from_list(sub {});
$lex->get_token;
$lex->error($message); 
$lex->warning($message); 
$lex->file; 
$lex->line_nr;
# in a nearby piece of code
use MyParser; # isa Parse::FSM::Driver;
my $parser = MyParser->new;
$parser->input(sub {$lex->get_token});
eval {$parser->parse}; $@ and $lex->error($@);
DESCRIPTION
This module implements a generic tokenizer that can be used by Parse::FSM parsers, and can also be used standalone independently of the parser.
It supports recursive file includes and takes track of current file name and line number. It keeps the path of search directories to search for input files.
The get_token method can be called by the input method of the parser to retrieves the next input token to parse.
The module can be used directly if the supplied tokenizer is enough for the application, but usually a derived class has to be written implementing a custom version of the tokenizer method.
METHODS - SETUP
new
Creates a new object. If an argument list is given, calls from_file for each of the file starting from the last, so that the files are read in the given order.
METHODS - SEARCH PATH FOR FILES
path
Returns the list of directories to search in sequence for source files.
add_path
Adds the given directories to the path searched for include files.
path_search
Searches for the given file name in the path created by add_path, returns the first full path name where the file can be found.
Returns the given input file name unchanged if:
the file is found in the current directory; or
the file is not found in any of the
pathdirectories.
METHODS - INPUT STREAM
from_file
Saves the current input context, searches for the given input file name in the path, opens the file and sets-up the object to read each line in sequence. At the end of the file input resumes to the place where it was when from_file was called.
Dies if the input file cannot be read, or if a file is included recursively, to avoid an infinite include loop.
from_list
Saves the current input context and sets-up the object to read each element of the passed input list. Each element either a text string or a code reference of an iterator that returns text strings. The iterator returns undef at the end of input.
METHODS - INPUT
get_token
Retrieves the next token from the input as an array reference containing token type and token value.
Returns undef on end of input.
tokenizer
Method responsible to match the next token from the given input string.
This method can be overridden by a child class in order to implement a different set of tokens to be retrieved from the input.
It is implemented with features from the Perl 5.010 regex engine:
one big regex with
/\G.../gcto match from where the last match ended; the string to match is passed as a scalar reference, so that the position of last matchpos()is preserved;one sequence of
(?:...|...)alternations for each token to be matched;using
(?>...)for each token to make sure there is no backtracking;using capturing parentheses and embedded code evaluation
(?{ [TYPE => $^N] })to return the token value from the regex match;using
$^Ras the value of the matched token;As the regex engine is not reentrant, any operation that may call another regex match (e.g. recursive file include) cannot be done inside the
(?{ ... })code block, and is done after the regex match by checking the$^Rfor special tokens.using
undefas the return of$^Rto ignore a token, e.g. white space.
The default tokenizer recognizes and returns the following token types:
- [STR => $value]
 - 
Perl-like single or double quoted string,
$valuecontains the string without the quotes and with any backslash escapes resolved.The string cannot span multiple input lines.
 - [NUM => $value]
 - 
Perl-like integer in decimal, hexadecimal, octal or binary notation,
$valuecontains decimal value of the integer. - [NAME => $name]
 - 
Perl-like identifier name, i.e. word starting with a letter or underscore and followed by letters, underscores or digits.
 - [$token => $token]
 - 
All other characters except white space are returned in the form
[$token=>$token], where$tokenis a single character or one of the following composed tokens: << >> == != >= <= - white space
 - 
All white space is ignored, i.e. the tokenizer returns
undef. - [INCLUDE => $file]
 - 
Returned when a
#includestatement is recognized, causes the lexer to recursively include the file at the current input stream location. - [INPUT_POS => $file, $line_nr, $line_inc]
 - 
Returned when a
#linestatement is recognized, causes the lexer to set the current input location to the given$file,$line_nrand$line_inc. - [ERROR => $message]
 - 
Causes the lexer to call
errorwith the given error message, can be used when the input cannot be tokenized. 
METHODS - INPUT LOCATION AND ERRORS
file
Returns the current input file, undef if reading from a list.
line_nr
Returns the current input line number, starting at 1.
line_inc
Increment of line number on each new-line found, usually 1.
error
Dies with the given error message, indicating the place in the input source file where the error occurred.
warning
Warns with the given error message, indicating the place in the input source file where the warning occurred.
AUTHOR, BUGS, FEEDBACK, LICENSE, COPYRIGHT
See Parse::FSM