NAME
Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.
VERSION
version 0.920
SYNOPSIS
use Text::Parser;
my $parser = Text::Parser->new();
$parser->read(shift);
print $parser->get_records, "\n";
The above code reads the first command-line argument as a string, and assuming it is the name of a text file, it will print the content of the file to STDOUT
. If the string is not the name of a text file it will throw an exception and exit.
package MyParser;
use Moose;
extends 'Text::Parser';
# use parent 'Text::Parser';
# This will also work, but the Moose based class may be easier to implement
sub save_record {
my $self = shift;
## ...
}
package main;
my $parser = MyParser->new(auto_split => 1, auto_chomp => 1, auto_trim => 'b');
$parser->read(shift);
foreach my $rec ($parser->get_records) {
## ...
}
The above example shows how Text::Parser
could be easily extended to parse a specific text format.
RATIONALE
Text parsing is perhaps the single most common thing that almost every Perl program does. Yet we don't have a lean, flexible, text parsing utility. Ideally, the developer should only have to specify the "grammar" of the text file she intends to parse. Everything else, like open
ing a file handle, close
ing the file handle, tracking line-count, joining continued lines into one, reporting any errors in line continuation, trimming white space, splitting each line into fields, etc., should be automatic.
Unfortunately however, most file parsing code looks like this:
open FH, "<$fname";
my $line_count = 0;
while (<FH>) {
$line_count++;
chomp;
$_ = trim $_; ## From String::Util
my (@fields) = split /\s+/;
# do something for each line ...
}
close FH;
Note that a developer may have to repeat all of the above if she has to read another file with different content or format. And if the target text format allows line-wrapping with a continuation character, it isn't easy to implement it well with this while
loop.
With Text::Parser
, developers can focus on specifying the grammar and simply use the read
method. Just extend this class with extends
(or inherit using parent
pragma), and override one method (save_record
). Voila! you have a parser. These examples illustrate how easy this can be.
OVERVIEW
Text::Parser
is a format-agnostic text parsing base class. Derived classes can specify the format-specific syntax they intend to parse.
CONSTRUCTOR
new
Takes optional attributes as in example below. See section ATTRIBUTES for a list of the attributes and their description.
my $parser = Text::Parser->new(
auto_chomp => 0,
multiline_type => 'join_last',
auto_trim => 'b',
auto_split => 1,
FS => qr/\s+/,
);
ATTRIBUTES
The attributes below can be used as options to the new
constructor. Each attribute has an accessor with the same name.
auto_chomp
Read-write attribute. Takes a boolean value as parameter. Defaults to 0
.
print "Parser will chomp lines automatically\n" if $parser->auto_chomp;
auto_split
Read-write boolean attribute. Defaults to 0
(false). Indicates if the parser will automatically split every line into fields.
If it is set to a true value, each line will be split into fields, and a set of methods (a quick list here) become accessible within the save_record
method. These methods are documented in Text::Parser::AutoSplit.
auto_trim
Read-write attribute. The values this can take are shown under the new
constructor also. Defaults to 'n'
(neither side spaces will be trimmed).
$parser->auto_trim('l'); # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)
FS
Read-write attribute that can be used to specify the field separator along with auto_split
attribute. It must be a regular expression reference enclosed in the qr
function, like qr/\s+|[,]/
which will split across either spaces or commas. The default value for this argument is qr/\s+/
.
The name for this attribute comes from the built-in FS
variable in the popular GNU Awk program.
$parser->FS( qr/\s+\(*|\s*\)/ );
FS
can be changed in your implementation of save_record
. But the changes would take effect only on the next line.
multiline_type
If the target text format allows line-wrapping with a continuation character, the multiline_type
option tells the parser to join them into a single line. When setting this attribute, one must re-define two more methods. See these examples.
By default, the read-write multiline_type
attribute has a value of undef
, i.e., the target text format will not have wrapped lines. It can be set to either 'join_next'
or 'join_last'
.
$parser->multiline_type(undef);
$parser->multiline_type('join_next');
my $mult = $parser->multiline_type;
print "Parser is a multi-line parser of type: $mult" if defined $mult;
If the target format allows line-wrapping to the next line, set
multiline_type
tojoin_next
. This example illustrates this case.If the target format allows line-wrapping from the last line, set
multiline_type
tojoin_last
. This simple SPICE line-joiner illustrates this case.To "slurp" a file into a single string, set
multiline_type
tojoin_last
. In this special case, you don't need to re-define theis_line_continued
andjoin_last_line
methods. See this trivial line-joiner example.
METHODS
These are meant to be called from the ::main
program or within subclasses. In general, don't override them - just use them.
read
Takes a single optional argument that can be either a string containing the name of the file, or a filehandle reference (a GLOB
) like \*STDIN
or an object of the FileHandle
class.
$parser->read($filename); # Read the file
$parser->read(\*STDIN); # Read the filehandle
The above could also be done in two steps if the developer so chooses.
$parser->filename($filename);
$parser->read(); # equiv: $parser->read($filename)
$parser->filehandle(\*STDIN);
$parser->read(); # equiv: $parser->read(\*STDIN)
The method returns once all records have been read, or if an exception is thrown, or if reading has been aborted with the abort_reading
method.
Any close
operation will be handled (even if any exception is thrown), as long as read
is called with a file name parameter - not if you call with a file handle or GLOB
parameter.
$parser->read('myfile.txt'); # Will close file automatically
open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
$parser->read(\*MYFH); # Will not close MYFH
close MYFH;
Note: To extend the class to other text formats, override save_record
.
filename
Takes an optional string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef
if no file has been opened.
print "Last read ", $parser->filename, "\n";
The value stored is "persistent" - meaning that the method remembers the last file that was read
.
$parser->read(shift @ARGV);
print $parser->filename(), ":\n",
"=" x (length($parser->filename())+1),
"\n",
$parser->get_records(),
"\n";
A read
call with a filehandle, will clear the last file name.
$parser->read(\*MYFH);
print "Last file name is lost\n" if not defined $parser->filename();
filehandle
Takes an optional argument, that is a filehandle GLOB
(such as \*STDIN
) or an object of the FileHandle
class. Returns the filehandle last saved, or undef
if none was saved.
my $fh = $parser->filehandle();
Like filename
, filehandle
is also "persistent". Its old value is lost when either filename
is set, or read
is called with a filename.
$parser->read(\*STDOUT);
my $lastfh = $parser->filehandle(); # Will return glob of STDOUT
lines_parsed
Takes no arguments. Returns the number of lines last parsed. Every call to read
, causes the value to be auto-reset.
print $parser->lines_parsed, " lines were parsed\n";
has_aborted
Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.
print "Aborted\n" if $parser->has_aborted();
get_records
Takes no arguments. Returns an array containing all the records saved by the parser.
foreach my $record ( $parser->get_records ) {
$i++;
print "Record: $i: ", $record, "\n";
}
pop_record
Takes no arguments and pops the last saved record.
my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);
last_record
Takes no arguments and returns the last saved record. Leaves the saved records untouched.
my $last_rec = $parser->last_record;
FOR USE IN SUBCLASS ONLY
Do NOT override these methods. They are valid only within a subclass, inside the user-implementation of methods described under OVERRIDE IN SUBCLASS.
this_line
Takes no arguments, and returns the current line being parsed. For example:
sub save_record {
# ...
do_something($self->this_line);
# ...
}
abort_reading
Takes no arguments. Returns 1
. To be used only in the derived class to abort read
in the middle. See this example.
sub save_record {
# ...
$self->abort_reading if some_condition($self->this_line);
# ...
}
push_records
This is useful if one needs to implement an include
-like command in some text format. The example below illustrates this.
package OneParser;
use Moose;
extends 'Text::Parser';
my save_record {
# ...
# Under some condition:
my $parser = AnotherParser->new();
$parser->read($some_file)
$parser->push_records($parser->get_records);
# ...
}
Other methods available on auto_split
When the auto_split
attribute is on, (or if it is turned on later), the following additional methods become available:
OVERRIDE IN SUBCLASS
The following methods should never be called in the ::main
program. They are meant to be overridden (or re-defined) in a subclass.
save_record
This method should be re-defined in a subclass to parse the target text format. To save a record, the re-defined implementation in the derived class must call SUPER::save_record
(or super
if you're using Moose) with exactly one argument as a record. If no arguments are passed, undef
is stored as a record.
For a developer re-defining save_record
, in addition to this_line
, six additional methods become available if the auto_split
attribute is set. These methods are described in greater detail in Text::Parser::AutoSplit, and they are accessible only within save_record
.
Note: Developers may store records in any form - string, array reference, hash reference, complex data structure, or an object of some class. The program that reads these records using get_records
has to interpret them. So developers should document the records created by their own implementation of save_record
.
FOR MULTI-LINE TEXT PARSING
These methods need to be re-defined by only multiline derived classes, i.e., if the target text format allows wrapping the content of one line into multiple lines. In most cases, you should re-define both methods. As usual, the this_line
method may be used while re-defining them.
is_line_continued
This takes a string argument and returns a boolean indicating if the line is continued or not. See Text::Parser::Multiline for more on this.
The return values of the default method provided with this class are:
multiline_type | Return value
------------------+---------------------------------
undef | 0
join_last | 0 for first line, 1 otherwise
join_next | 1
join_last_line
This method takes two strings, joins them while removing any continuation characters, and returns the result. The default implementation just concatenates two strings and returns the result without removing anything (not even chomp). See Text::Parser::Multiline for more on this.
EXAMPLES
Example 1 : A simple CSV Parser
We will write a parser for a simple CSV file that reads each line and stores the records as array references. This example is oversimplified, and does not handle embedded newlines.
package Text::Parser::CSV;
use Moose;
extends 'Text::Parser';
use Text::CSV;
my $csv;
sub save_record {
my ($self, $line) = @_;
$csv //= Text::CSV->new({ binary => 1, auto_diag => 1});
$csv->parse($line);
$self->SUPER::save_record([$csv->fields]);
}
That's it! Now in main::
you can write something like this:
use Text::Parser::CSV;
my $csvp = Text::Parser::CSV->new();
$csvp->read(shift @ARGV);
foreach my $aref ($csvp->get_records) {
my (@arr) = @{$aref};
print "@arr\n";
}
The above program reads the content of a given CSV file and prints the content out in space-separated form.
Example 2 : Error checking
Note: Read the documentation for Exceptions
to learn about creating, throwing, and catching exceptions in Perl 5. All of the methods of creating, throwing, and catching exceptions described in Exceptions are supported.
You can throw exceptions from save_record
in your subclass, for example, when you detect a syntax error. The read
method will close
all filehandles automatically as soon as an exception is thrown. The exception will pass through to ::main
unless you catch and handle it in your derived class.
Here is an example showing the use of an exception to detect a syntax error in a file:
package My::Text::Parser;
use Exception::Class (
'My::Text::Parser::SyntaxError' => {
description => 'syntax error',
alias => 'throw_syntax_error',
},
);
use Moose;
extends 'Text::Parser';
sub save_record {
my ($self, $line) = @_;
throw_syntax_error(error => 'syntax error') if _syntax_error($line);
$self->SUPER::save_record($line);
}
Example 3 : Aborting without errors
We can also abort parsing a text file without throwing an exception. This could be if we got the information we needed. For example:
package SomeParser;
use Moose;
extends 'Text::Parser';
sub BUILDARGS {
my $pkg = shift;
return {auto_split => 1};
}
sub save_record {
my ($self, $line) = @_;
return $self->abort_reading() if $self->field(0) eq '**ABORT';
return $self->SUPER::save_record($line);
}
Above is shown a parser SomeParser
that would save each line as a record, but would abort reading the rest of the file as soon as it reaches a line with **ABORT
as the first word. When this parser is given the following file as input:
somefile.txt:
Some text is here.
More text here.
**ABORT reading
This text is not read
This text is not read
This text is not read
This text is not read
You can now write a program as follows:
use SomeParser;
my $par = SomeParser->new();
$par->read('somefile.txt');
print $par->get_records(), "\n";
The output will be:
Some text is here.
More text here.
Example 4 : Multi-line parsing
Some text formats allow users to split a line into several lines with a line continuation character (usually at the end or the beginning of a line).
Trivial line-joiner
Below is a trivial example where all lines are joined into one:
use strict;
use warnings;
use Text::Parser;
my $join_all = Text::Parser->new(auto_chomp => 1, multiline_type => 'join_last');
$join_all->read('input.txt');
print $join_all->get_records(), "\n";
Another trivial example is here.
Continue with character
(Pun intended! ;-))
In the above example, all lines are joined (indiscriminately). But most often text formats have a continuation character that specifies that the line continues to the next line, or that the line is a continuation of the previous line. Here's an example parser that treats the back-slash (\
) character as a line-continuation character:
package MyMultilineParser;
use Moose;
extends 'Text::Parser';
use strict;
use warnings;
sub new {
my $pkg = shift;
$pkg->SUPER::new(multiline_type => 'join_next');
}
sub is_line_continued {
my $self = shift;
my $line = shift;
chomp $line;
return $line =~ /\\\s*$/;
}
sub join_last_line {
my $self = shift;
my ($last, $line) = (shift, shift);
chomp $last;
$last =~ s/\\\s*$/ /g;
return $last . $line;
}
1;
In your main::
use MyMultilineParser;
use strict;
use warnings;
my $parser = MyMultilineParser->new();
$parser->read('multiline.txt');
print "Read:\n"
print $parser->get_records(), "\n";
Try with the following input multiline.txt:
Garbage In.\
Garbage Out!
When you run the above code with this file, you should get:
Read:
Garbage In. Garbage Out!
Simple SPICE line joiner
Some text formats allow a line to indicate that it is continuing from a previous line. For example SPICE has a continuation character (+
) on the next line, indicating that the text on that line should be joined with the previous line. Let's show how to build a simple SPICE line-joiner. To build a full-fledged parser you will have to specify the rich and complex grammar for SPICE circuit description.
use TrivialSpiceJoin;
use Moose;
extends 'Text::Parser';
use constant {
SPICE_LINE_CONTD => qr/^[+]\s*/,
SPICE_END_FILE => qr/^\.end/i,
};
sub new {
my $pkg = shift;
$pkg->SUPER::new(auto_chomp => 1, multiline_type => 'join_last');
}
sub is_line_continued {
my ( $self, $line ) = @_;
return 0 if not defined $line;
return $line =~ SPICE_LINE_CONTD;
}
sub join_last_line {
my ( $self, $last, $line ) = ( shift, shift, shift );
return $last if not defined $line;
$line =~ s/^[+]\s*/ /;
return $line if not defined $last;
return $last . $line;
}
sub save_record {
my ( $self, $line ) = @_;
return $self->abort_reading() if $line =~ SPICE_END_FILE;
$self->SUPER::save_record($line);
}
Try this parser with a SPICE deck with continuation characters and see what you get. Try having errors in the file. You may now write a more elaborate method for save_record
above and that could be used to parse a full SPICE file.
THINGS TO BE DONE
Future versions are expected to include:
progress-bar support
parsing from a buffer
automatically uncompress input
suggestions welcome ...
Interested contributors welcome.
SEE ALSO
BUGS
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
CONTRIBUTORS
H.Merijn Brand - Tux <h.m.brand@xs4all.nl>
Mohammad S Anwar <mohammad.anwar@yahoo.com>