NAME
Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.
VERSION
version 0.927
SYNOPSIS
use Text::Parser;
my $parser = Text::Parser->new();
$parser->read(shift);
print $parser->get_records, "\n";
The above code prints the content of the file (named in the first argument) to STDOUT
.
my $parser = Text::Parser->new();
$parser->add_rule(do => 'print');
$parser->read(shift);
This example also dones the same as the earlier one. For more complex examples see the manual.
OVERVIEW
The need for this class stems from the fact that text parsing is the most common thing that programmers do, and yet there is no lean, simple way to do it efficiently. Most programmers still write boilerplate code with a while
loop.
Instead Text::Parser
allows programmers to parse text with terse, self-explanatory rules, whose structure is very similar to AWK, but extends beyond the capability of AWK. Incidentally, AWK is one of the ancestors of Perl! One would have expected Perl to extend the capabilities of AWK, although that's not really the case. Command-line perl -lane
or even perl -lan script.pl
are very limited in what they can do. Programmers cannot use them for serious projects. And parsing text files in regular Perl involves writing the same while
loop again. This website summarizes the options available in Perl so far.
With Text::Parser
, a developer can focus on specifying a grammar and then simply read
the file. The read
method automatically runs each rule collecting records from the text input into an array internally. And finally get_records
can retrieve the records. Thus the programmer now has the power of Perl to create complex data structures, along with the elegance of AWK to parse text files. The manuals illustrate this with examples.
CONSTRUCTOR
new
Takes optional attributes as in example below. See section ATTRIBUTES for a list of the attributes and their description.
my $parser = Text::Parser->new(
auto_chomp => 0,
multiline_type => 'join_last',
auto_trim => 'b',
auto_split => 1,
FS => qr/\s+/,
);
ATTRIBUTES
The attributes below can be used as options to the new
constructor. Each attribute has an accessor with the same name.
auto_chomp
Read-write attribute. Takes a boolean value as parameter. Defaults to 0
.
print "Parser will chomp lines automatically\n" if $parser->auto_chomp;
auto_split
Read-write boolean attribute. Defaults to 0
(false). Indicates if the parser will automatically split every line into fields.
If it is set to a true value, each line will be split into fields, and a set of methods (a quick list here) become accessible within the save_record
method. These methods are documented in Text::Parser::AutoSplit.
auto_trim
Read-write attribute. The values this can take are shown under the new
constructor also. Defaults to 'n'
(neither side spaces will be trimmed).
$parser->auto_trim('l'); # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)
FS
Read-write attribute that can be used to specify the field separator to be used by the auto_split
feature. It must be a regular expression reference enclosed in the qr
function, like qr/\s+|[,]/
which will split across either spaces or commas. The default value for this argument is qr/\s+/
.
The name for this attribute comes from the built-in FS
variable in the popular GNU Awk program.
$parser->FS( qr/\s+\(*|\s*\)/ );
FS
can be changed in your implementation of save_record
. But the changes would take effect only on the next line.
multiline_type
If the target text format allows line-wrapping with a continuation character, the multiline_type
option tells the parser to join them into a single line. When setting this attribute, one must re-define two more methods.
By default, the read-write multiline_type
attribute has a value of undef
, i.e., the target text format will not have wrapped lines. It can be set to either 'join_next'
or 'join_last'
.
$parser->multiline_type(undef);
$parser->multiline_type('join_next');
my $mult = $parser->multiline_type;
print "Parser is a multi-line parser of type: $mult" if defined $mult;
If the target format allows line-wrapping to the next line, set
multiline_type
tojoin_next
.If the target format allows line-wrapping from the last line, set
multiline_type
tojoin_last
.To "slurp" a file into a single string, set
multiline_type
tojoin_last
. In this special case, you don't need to re-define theis_line_continued
andjoin_last_line
methods.
METHODS
These are meant to be called from the ::main
program or within subclasses. In general, don't override them - just use them.
add_rule
Takes a hash as input. The keys of this hash must be the attributes of the Text::Parser::Rule class constructor and the values should also meet the requirements of that constructor.
$parser->add_rule(do => '', dont_record => 1); # Empty rule: does nothing
$parser->add_rule(if => 'm/li/, do => 'print', dont_record); # Prints lines with 'li'
$parser->add_rule( do => 'uc($3)' ); # Saves records of upper-cased third elements
Calling this method without any arguments will throw an exception. The method internally sets the auto_split
attribute.
clear_rules
Takes no arguments, returns nothing. Clears the rules that were added to the object.
$parser->clear_rules;
This is useful to be able to re-use the parser after a read
call, to parse another text with another set of rules. The clear_rules
method does clear even the rules set up by BEGIN_rule
and END_rule
.
BEGIN_rule
Takes a hash input like add_rule
, but if
and continue_to_next
keys will be ignored.
$parser->BEGIN_rule(do => '~count = 0;');
Since any
if
key is ignored, thedo
key is alwayseval
uated. Multiple calls toBEGIN_rule
will append to the previous calls; meaning, the actions of previous calls will be included.The
BEGIN
block is mainly used to initialize some variables. So by defaultdont_record
is set true. User can change this and setdont_record
as false, thus forcing a record to be saved.
END_rule
Takes a hash input like add_rule
, but if
and continue_to_next
keys will be ignored. Similar to BEGIN_rule
, but the actions in the END_rule
will be executed at the end of the read
method.
$parser->END_rule(do => 'print ~count, "\n";');
Since any
if
key is ignored, thedo
key is alwayseval
uated. Multiple calls toEND_rule
will append to the previous calls; meaning, the actions of previous calls will be included.The
END
block is mainly used to do final processing of collected records. So by defaultdont_record
is set true. User can change this and setdont_record
as false, thus forcing a record to be saved.
read
Takes a single optional argument that can be either a string containing the name of the file, or a filehandle reference (a GLOB
) like \*STDIN
or an object of the FileHandle
class.
$parser->read($filename); # Read the file
$parser->read(\*STDIN); # Read the filehandle
The above could also be done in two steps if the developer so chooses.
$parser->filename($filename);
$parser->read(); # equiv: $parser->read($filename)
$parser->filehandle(\*STDIN);
$parser->read(); # equiv: $parser->read(\*STDIN)
The method returns once all records have been read, or if an exception is thrown, or if reading has been aborted with the abort_reading
method.
Any close
operation will be handled (even if any exception is thrown), as long as read
is called with a file name parameter - not if you call with a file handle or GLOB
parameter.
$parser->read('myfile.txt'); # Will close file automatically
open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
$parser->read(\*MYFH); # Will not close MYFH
close MYFH;
Note: To extend the class to other text formats, override save_record
.
filename
Takes an optional string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef
if no file has been opened.
print "Last read ", $parser->filename, "\n";
The value stored is "persistent" - meaning that the method remembers the last file that was read
.
$parser->read(shift @ARGV);
print $parser->filename(), ":\n",
"=" x (length($parser->filename())+1),
"\n",
$parser->get_records(),
"\n";
A read
call with a filehandle, will clear the last file name.
$parser->read(\*MYFH);
print "Last file name is lost\n" if not defined $parser->filename();
filehandle
Takes an optional argument, that is a filehandle GLOB
(such as \*STDIN
) or an object of the FileHandle
class. Returns the filehandle last saved, or undef
if none was saved.
my $fh = $parser->filehandle();
Like filename
, filehandle
is also "persistent". Its old value is lost when either filename
is set, or read
is called with a filename.
$parser->read(\*STDOUT);
my $lastfh = $parser->filehandle(); # Will return glob of STDOUT
lines_parsed
Takes no arguments. Returns the number of lines last parsed. Every call to read
, causes the value to be auto-reset.
print $parser->lines_parsed, " lines were parsed\n";
has_aborted
Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.
print "Aborted\n" if $parser->has_aborted();
get_records
Takes no arguments. Returns an array containing all the records saved by the parser.
foreach my $record ( $parser->get_records ) {
$i++;
print "Record: $i: ", $record, "\n";
}
pop_record
Takes no arguments and pops the last saved record.
my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);
last_record
Takes no arguments and returns the last saved record. Leaves the saved records untouched.
my $last_rec = $parser->last_record;
USE ONLY IN RULES AND SUBCLASS
Do NOT override these methods. They are valid only within a subclass, inside the user-implementation of methods described under OVERRIDE IN SUBCLASS.
this_line
Takes no arguments, and returns the current line being parsed. For example:
sub save_record {
# ...
do_something($self->this_line);
# ...
}
abort_reading
Takes no arguments. Returns 1
. To be used only in the derived class to abort read
in the middle.
sub save_record {
# ...
$self->abort_reading if some_condition($self->this_line);
# ...
}
push_records
This is useful if one needs to implement an include
-like command in some text format. The example below illustrates this.
package OneParser;
use Moose;
extends 'Text::Parser';
my save_record {
# ...
# Under some condition:
my $parser = AnotherParser->new();
$parser->read($some_file)
$parser->push_records($parser->get_records);
# ...
}
Other methods available on auto_split
When the auto_split
attribute is on, (or if it is turned on later), the following additional methods become available:
OVERRIDE IN SUBCLASS
The following methods should never be called in the ::main
program. They may be overridden (or re-defined) in a subclass.
save_record
This method may be re-defined in a subclass to parse the target text format. The default implementation takes a single argument and stores it as a record. If no arguments are passed, undef
is stored as a record. Note that unlike earlier versions of Text::Parser
it is not required to override this method in your derived class. You can simply use the rules instead.
For a developer re-defining save_record
, in addition to this_line
, six additional methods become available if the auto_split
attribute is set. These methods are described in greater detail in Text::Parser::AutoSplit, and they are accessible only within save_record
.
Note: Developers may store records in any form - string, array reference, hash reference, complex data structure, or an object of some class. The program that reads these records using get_records
has to interpret them. So developers should document the records created by their own implementation of save_record
.
PARSING LINE-WRAPPED FILES
These methods are useful when parsing line-wrapped files, i.e., if the target text format allows wrapping the content of one line into multiple lines. In such cases, you should extend
the Text::Parser
class and override the following methods.
is_line_continued
If the target text format supports line-wrapping, the developer must override and implement this method. Your method should take a string argument and return a boolean indicating if the line is continued or not.
There is a default implementation shipped with this class with return values as follows:
multiline_type | Return value
------------------+---------------------------------
undef | 0
join_last | 0 for first line, 1 otherwise
join_next | 1
join_last_line
Again, the developer should implement this method. This method should take two strings, join them while removing any continuation characters, and return the result. The default implementation just concatenates two strings and returns the result without removing anything (not even chomp
). See Text::Parser::Multiline for more on this.
EXAMPLES
You can find example code in Text::Parser::Manual::ComparingWithNativePerl.
THINGS TO BE DONE
This package is still a work in progress. Future versions are expected to include features to:
read and parse from a buffer
automatically uncompress input
suggestions welcome ...
Contributions and suggestions are welcome and properly acknowledged.
SEE ALSO
Text::Parser::Manual - Read this manual
The AWK Programming Language - by Aho, Weinberg, and Kernighan.
Text::Parser::Errors - documentation of the exceptions this class throws
Text::Parser::Multiline - how to read line-wrapped text input
BUGS
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
CONTRIBUTORS
H.Merijn Brand - Tux <h.m.brand@xs4all.nl>
Mohammad S Anwar <mohammad.anwar@yahoo.com>