NAME
Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.
VERSION
version 0.917
SYNOPSIS
use Text::Parser;
my $parser = Text::Parser->new();
$parser->read(shift);
print $parser->get_records, "\n";
The above code reads the first command-line argument as a string, and assuming it is the name of a text file, it will print the content of the file to STDOUT
. If the string is not the name of a text file it will throw an exception and exit.
package MyParser;
use parent 'Text::Parser';
## or use Moose; extends 'Text::Parser';
sub save_record {
my $self = shift;
## ...
}
package main;
my $parser = MyParser->new(auto_split => 1, auto_chomp => 1, auto_trim => 'b');
$parser->read(shift);
foreach my $rec ($parser->get_records) {
## ...
}
The above example shows how Text::Parser
could be easily extended to parse a specific text format.
RATIONALE
Text parsing is perhaps the single most common thing that almost every Perl program does. Yet we don't have a lean, flexible, text parsing utility. Ideally, the developer should only have to specify the "grammar" of the text file she intends to parse. Everything else, like open
ing a file handle, close
ing the file handle, tracking line-count, joining continued lines into one, reporting any errors in line continuation, trimming white space, splitting each line into fields, etc., should be automatic.
Unfortunately however, most file parsing code looks like this:
open FH, "<$fname";
my $line_count = 0;
while (<FH>) {
$line_count++;
chomp;
$_ = trim $_; ## From String::Util
my (@fields) = split /\s+/;
# do something for each line ...
}
close FH;
Note that a developer may have to repeat all of the above if she has to read another file with different content or format. And if the text has line-continuation characters, it isn't easy to implement it well with the while
loop above.
With Text::Parser
, developers can focus on specifying the grammar and simply use the read
method. Just inherit the class and override one method (save_record
). Voila! you have a parser. These examples illustrate how easy this can be.
DESCRIPTION
Text::Parser
is a format-agnostic text parsing utility class. Derived classes can specify the format-specific syntax they intend to parse. Usually just methods needs to be overridden to do this. But of course derived classes can create any additional attributes or methods needed to interpret the fomart and extract records.
Future versions are expected to include progress-bar support, parsing text from sockets, UTF support, or parsing from a chunk of memory. All these software features are text-format independent and should be re-used. Derived classes of Text::Parser
will be able to take advantage of these features seamlessly, while the base class handles the "mundane" details.
CONSTRUCTOR
new
Takes optional attributes in the form of a hash. See section ATTRIBUTES for a list of the attributes and their description. Throws an exception if you use wrong inputs to create an object.
my $parser = Text::Parser->new(
auto_chomp => 0, # 0 (Default) or 1
# - automatically chomp lines
multiline_type => 'join_last', # 'join_last'|'join_next'|undef ; Default: undef
auto_trim => 'b', # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)
# - automatically trim leading and trailing whitespaces
auto_split => 1, # Auto-splits lines into fields
FS => qr/\s+/, # Used by auto_split feature above. Default: qr/\s+/
);
This $parser
variable will be used in all examples below.
ATTRIBUTES
The attributes below can be used as options to the new
constructor. Each attribute has an accessor with the same name.
auto_chomp
Read-write attribute. Takes a boolean value as parameter. Defaults to 0.
print "Parser will chomp lines automatically\n" if $parser->auto_chomp;
auto_split
Read-only attribute that can be set only during object construction. Defaults to 0. This attribute indicates if the parser will automatically split every line into fields. If it is set to a true value, each line will be split into fields which can be accessed through special methods that become available. These methods are documented in Text::Parser::AutoSplit. The field separator can be set using another attribute named 'FS'
.
auto_trim
Read-write attribute. The values this can take are shown under the new
constructor also. Defaults to 'n'
(neither side spaces will be trimmed).
$parser->auto_trim('l'); # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)
FS
Read-write attribute that can be used to specify the field separator along with auto_split
attribute. It must be a regular expression reference enclosed in the qr
function, like qr/\s+|[,]/
which will split across either spaces or commas. The default value for this argument is qr/\s+/
.
The name for this attribute comes from the built-in FS
variable in the popular GNU Awk program.
$parser->FS( qr/\s+\(*|\s*\)/ );
You can change the field separator in the course of parsing a file. But the changes would take effect only on the next line. For example:
package MyParser;
use Moose;
extends 'Text::Parser';
sub BUILDARGS {
return {
auto_split => 1,
auto_chomp => 1,
auto_trim => 'b'
};
}
sub save_record {
my $self = shift;
$self->FS(qr/[,]/) if $self->field(0) eq 'CSV_BELOW';
$self->SUPER::save_record([$self->fields]);
}
package main;
use Data::Dumper 'Dumper';
my $parser = MyParser->new();
$parser->read('input.txt');
print Dumper([$parser->get_records]), "\n";
Now, let us say you have a file input.txt with the following content:
Some information in this file
CSV_BELOW
col1,col2,col3
data1,1,1
data2,2,4
data3,3,9
Then the output will be:
$VAR1 = [
[ 'Some', 'information', 'in', 'this', 'file' ],
[ 'CSV_BELOW' ],
[ 'col1', 'col2', 'col3' ],
[ 'data1', '1', '1' ],
[ 'data2', '2', '4' ],
[ 'data3', '3', '9' ]
];
multiline_type
Read-write attribute. Takes a value that is either undef
or one of strings 'join_next'
or 'join_last'
.
my $mult = $parser->multiline_type;
print "Parser is a multi-line parser of type: $mult" if defined $mult;
$parser->multiline_type(undef);
# setting this to undef will throw an exception if it was previously set to a real value like
# 'join_next' or 'join_last'. In this case, since $parser was of 'join_last' type, there will
# be an exception
$parser->multiline_type('join_next');
# Changes the parser to a multiline parser of type 'join_next'
# This is okay.
What value should I choose?
If your text format allows users to break up what should be on a single line into another line using a continuation character, you need to use the multiline_type
option. The option tells the parser to join lines back into a single line, so that your save_record
method doesn't have to bother about joining the continued lines, stripping any continuation characters, line-feeds etc. There are two variations in this:
If your format allows something like a trailing back-slash or some other character to indicate that text on next line is to be joined with this one, then choose
join_next
. See this example.If your format allows some character to indicate that text on the current line is part of the last line, then choose
join_last
. See this simple SPICE line-joiner as an example. Note: If you have no continuation character, but you want to just join all the lines into one single line and then callsave_record
only once for the whole text block, then usejoin_last
. See this trivial line-joiner.
Remember that join_next
multi-line parsers will blindly look for input to be continued on the next line, even if EOF
has been reached. This means, if you want to "slurp" a file into a single large string, without any continuation characters, you must use the join_last
multi-line type.
METHODS
read
Takes zero or one argument which could be a string containing the name of the file, or a filehandle reference (a GLOB
) like \*STDIN
or an object of the FileHandle
class. Throws an exception if filename/GLOB
provided is either non-existent or cannot be read for any reason.
Note: Normally if you provide the GLOB
of a file opened for write, some Operating Systems allow reading from it too, and some don't. Read the documentation for filehandle
for more on this.
$parser->read($filename);
# The above is equivalent to the following
$parser->filename($filename);
$parser->read();
# You can also read from a previously opened file handle directly
$parser->filehandle(\*STDIN);
$parser->read();
Returns once all records have been read or if an exception is thrown for any parsing errors, or if reading has been aborted with the abort_reading
method.
If you provide a filename as input, the function will handle all open
and close
operations on files even if any exception is thrown, or if the reading has been aborted. But if you pass a file handle GLOB
or FileHandle
object instead, then the file handle won't be closed and it will be the responsibility of the calling program to close the filehandle.
$parser->read('myfile.txt');
# Will handle open, parsing, and closing of file automatically.
open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
$parser->read(\*MYFH);
# Will not close MYFH and it is the respo
close MYFH;
When you do read a new file name or file handle with this method, you will lose all the records stored from the previous read operation. So this means that if you want to read a different file with the same parser object, (unless you don't care about the records from the last file you read) you should use the get_records
method to retrieve all the read records before parsing a new file. So all those calls to read
in the example above were parsing three different files, and each successive call overwrote the records from the previous call.
$parser->read($file1);
my (@records) = $parser->get_records();
$parser->read(\*STDIN);
my (@stdin) = $parser->get_records();
Note: To extend the class to other file formats, override save_record
instead of this one.
Future Enhancement
At present the read
method takes only two possible inputs argument types, either a file name, or a file handle. In future this may be enhanced to read from sockets, subroutines, or even just a block of memory (a string reference). Suggestions for other forms of input are welcome.
filename
Takes zero or one string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef
if no file has been opened.
print "Last read ", $parser->filename, "\n";
The file name is "persistent" in the object. Meaning, even after you have called read
once, it still remembers the file name. So you can do this:
$parser->read(shift @ARGV);
print $parser->filename(), ":\n",
"=" x (length($parser->filename())+1),
"\n",
$parser->get_records(),
"\n";
But if you do a read
with a filehandle as argument, you'll see that the last filename is lost - which makes sense.
$parser->read(\*MYFH);
print "Last file name is lost\n" if not defined $parser->filename();
filehandle
Takes zero or one argument that must be either a filehandle GLOB
(such as \*STDIN
) or an object of the FileHandle
class. The method saves it for future a read
call. Returns the filehandle last saved, or undef
if none was saved. Remember that after a successful read
call, filehandles are lost.
my $fh = $parser->filehandle();
Like in the case of filename
method, if after you read
with a filehandle, you call read
again, this time with a file name, the last filehandle is lost.
my $lastfh = $parser->filehandle();
## Will return STDOUT
$parser->read('another.txt');
print "No filehandle saved any more\n" if
not defined $parser->filehandle();
lines_parsed
Takes no arguments. Returns the number of lines last parsed. A line is reckoned when the \n
character is encountered.
print $parser->lines_parsed, " lines were parsed\n";
The value is auto-updated during the execution of read
. See this example of how this can be used in derived classes.
Again the information in this is "persistent". But you can also be assured that every time you call read
, the value be auto-reset before parsing.
get_records
Takes no arguments. Returns an array containing all the records saved by the parser.
foreach my $record ( $parser->get_records ) {
$i++;
print "Record: $i: ", $record, "\n";
}
has_aborted
Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.
print "Aborted\n" if $parser->has_aborted();
pop_record
Takes no arguments and pops the last saved record.
my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);
last_record
Takes no arguments and returns the last saved record. Leaves the saved records untouched.
my $last_rec = $parser->last_record;
OVERRIDE IN SUBCLASS
save_record
Takes exactly one argument and that is saved as a record. Additional arguments are ignored. If no arguments are passed, then undef
is stored as a record.
In an application that uses a text parser, you will most-likely never call this method directly. It is automatically called within read
for each line. In this base class Text::Parser
, save_record
is simply called with a string containing the raw line of text ; i.e. the line of text will not be chomp
ed or modified in any way (unless of course the auto_chomp
attribute is turned on). Here is a basic example.
Derived classes can decide to store records in a different form. A derived class could, for example, store the records in the form of hash references (so that when you use get_records
, you'd get an array of hashes), or maybe even another array reference (so when you use get_records
you'd get an array of arrays). The CSV parser example does the latter.
line_auto_manip
A method that could be overridden to manipulate each line before it gets to save_record
method. Because this is called before the save_record
method, it is called even before the Text::Parser::Multiline
role can be called. You will almost never call this method in a program directly but might use it in subclasses.
The default implementation chomp
s lines (if auto_chomp
is true) and trims leading/trailing whitespace (if auto_trim
is not 'n'
).
If you override this method, remember that it takes a string as input and returns a string.
is_line_continued
This method is to be defined by the derived class and is used only for multi-line parsers. Look under FOR MULTI-LINE TEXT PARSING for details.
DON'T OVERRIDE IN SUBCLASS
push_records
Don't override this method unless you know what you're doing. This method is useful if you have to copy the records from another parser. It is a general-purpose method for storing records that have been prepared before-hand. It is not supposed to be used to modify the arguments and make records (like save_record
does).
$parser->push_records(
$another_parser->get_records
);
FOR USE IN SUBCLASS
abort_reading
Takes no arguments. Returns 1
. You will probably never call this method in your main program.
This method is usually used only in the derived class. See this example.
FOR MULTI-LINE TEXT PARSING
is_line_continued
Takes a string argument and returns a boolean indicating of the line is continued or not. If the user defines a new text format with multi-line support, they should implement this method. An example implementation would look like this:
sub is_line_continued {
my ($self, $line) = @_;
chomp $line;
$line =~ /\\\s*$/;
}
The above example method checks if a line is being continued by using a back-slash character (\
).
The default method provided in this class will return 0
if the parser is not a multi-line parser. If it is a multi-line parser, return value depends on the type of multiline parser. If it is of type 'join_last'
, then it returns 1
for all lines except the first line. This means all lines continue from the previous line (except the first line, because there is no line before that). But if it is of type 'join_next'
, then it returns 1
for all lines unconditionally. Note: This means the parser will expect further lines, even when the last line in the text input has been read. Thus you need to have a way to indicate that there is no further continuation. This is why if you are building a trivial line-joiner, you should use the 'join_last'
type. See this example
Most users would never need to use this method in their own programs, but if one is writing a parser for a specific format that supports multi-line extension, mostly they'd have to implement it.
join_last_line
This method can be overridden in multi-line text parsing. The method takes two string arguments and joins them in a way that removes the continuation character. The default implementation just concatenates two strings and returns the result without removing anything. You should redefine this method to strip any continuation characters and join the strings with any required spaces. Below is an example of a method which strips the ending back-slash continuation characters, that were detected in the is_line_continued
method above.
sub join_last_line {
my $self = shift;
my ($last, $line) = (shift, shift);
$last =~ s/\\\s*$//g;
return "$last $line";
}
DEPRECATED
setting
This method has been deprecated. Use multiline_type
and auto_chomp
instead.
(Note: This deprecated method cannot be used with the auto_trim
attribute)
This method will disappear from version 1.0 onwards.
EXAMPLES
Basic principle
Derived classes simply need to override one method : save_record
. With the help of that any arbitrary file format can be read. save_record
should interpret the format of the text and store it in some form by calling SUPER::save_record
. The main::
program will then use the records and create an appropriate data structure with it.
Notice that the creation of a data structure is not the objective of a parser. It is simply concerned with collecting data and arranging it in a form that can be used. That's all. Data structures can be created by a different part of your program using the data collected by your parser.
Note: There is support for Moose. So you could use extends 'Text::Parser'
instead of the use parent
pragma in these examples. The examples in this documentation will show non-Moose classic Perl OO derived classes for ease of understanding. Those who know how to use
class automators like Moo/Moose should be able to follow.
Example 1 : A simple CSV Parser
We will write a parser for a simple CSV file that reads each line and stores the records as array references. This example is oversimplified, and does not handle embedded newlines.
package Text::Parser::CSV;
use parent 'Text::Parser';
use Text::CSV;
my $csv;
sub save_record {
my ($self, $line) = @_;
$csv //= Text::CSV->new({ binary => 1, auto_diag => 1});
$csv->parse($line);
$self->SUPER::save_record([$csv->fields]);
}
That's it! Now in main::
you can write something like this:
use Text::Parser::CSV;
my $csvp = Text::Parser::CSV->new();
$csvp->read(shift @ARGV);
foreach my $aref ($csvp->get_records) {
my (@arr) = @{$aref};
print "@arr\n";
}
The above program reads the content of a given CSV file and prints the content out in space-separated form.
Example 2 : Error checking
This class encourages the use of exceptions for error checking. Read the documentation for Exceptions
to learn about creating, throwing, and catching exceptions in Perl 5. All of the methods of creating, throwing, and catching exceptions described in Exceptions are supported.
You can throw exceptions from save_record
, for example, when you detect a syntax error. The read
method will close
all filehandles automatically as soon as an exception is thrown. The exception will pass through to ::main
unless you catch and handle it in your derived class.
Here is an example showing the use of an exception to detect a syntax error in a file:
package My::Text::Parser;
use Exception::Class (
'My::Text::Parser::SyntaxError' => {
description => 'syntax error',
alias => 'throw_syntax_error',
},
);
use parent 'Text::Parser';
sub save_record {
my ($self, $line) = @_;
throw_syntax_error(error => 'syntax error') if _syntax_error($line);
$self->SUPER::save_record($line);
}
Example 3 : Aborting without errors
We can also abort parsing a text file without throwing an exception. This could be if we got the information we needed. For example:
package SomeParser;
use Moose;
extends 'Text::Parser';
sub BUILDARGS {
my $pkg = shift;
return {auto_split => 1};
}
sub save_record {
my ($self, $line) = @_;
return $self->abort_reading() if $self->field(0) eq '**ABORT';
return $self->SUPER::save_record($line);
}
Above is shown a parser SomeParser
that would save each line as a record, but would abort reading the rest of the file as soon as it reaches a line with **ABORT
as the first word. When this parser is given the following file as input:
somefile.txt:
Some text is here.
More text here.
**ABORT reading
This text is not read
This text is not read
This text is not read
This text is not read
You can now write a program as follows:
use SomeParser;
my $par = SomeParser->new();
$par->read('somefile.txt');
print $par->get_records(), "\n";
The output will be:
Some text is here.
More text here.
Example 4 : Multi-line parsing
Some text formats allow users to split a line into several lines with a line continuation character (usually at the end or the beginning of a line).
Trivial line-joiner
Below is a trivial example where all lines are joined into one:
use strict;
use warnings;
use Text::Parser;
my $join_all = Text::Parser->new(auto_chomp => 1, multiline_type => 'join_last');
$join_all->read('input.txt');
print $join_all->get_records(), "\n";
Another trivial example is here.
Continue with character
(Pun intended! ;-))
In the above example, all lines are joined (indiscriminately). But most often text formats have a continuation character that specifies that the line continues to the next line, or that the line is a continuation of the previous line. Here's an example parser that treats the back-slash (\
) character as a line-continuation character:
package MyMultilineParser;
use parent 'Text::Parser';
use strict;
use warnings;
sub new {
my $pkg = shift;
$pkg->SUPER::new(multiline_type => 'join_next');
}
sub is_line_continued {
my $self = shift;
my $line = shift;
chomp $line;
return $line =~ /\\\s*$/;
}
sub join_last_line {
my $self = shift;
my ($last, $line) = (shift, shift);
chomp $last;
$last =~ s/\\\s*$/ /g;
return $last . $line;
}
1;
In your main::
use MyMultilineParser;
use strict;
use warnings;
my $parser = MyMultilineParser->new();
$parser->read('multiline.txt');
print "Read:\n"
print $parser->get_records(), "\n";
Try with the following input multiline.txt:
Garbage In.\
Garbage Out!
When you run the above code with this file, you should get:
Read:
Garbage In. Garbage Out!
Simple SPICE line joiner
Some text formats allow a line to indicate that it is continuing from a previous line. For example SPICE has a continuation character (+
) on the next line, indicating that the text on that line should be joined with the previous line. Let's show how to build a simple SPICE line-joiner. To build a full-fledged parser you will have to specify the rich and complex grammar for SPICE circuit description.
use TrivialSpiceJoin;
use parent 'Text::Parser';
use constant {
SPICE_LINE_CONTD => qr/^[+]\s*/,
SPICE_END_FILE => qr/^\.end/i,
};
sub new {
my $pkg = shift;
$pkg->SUPER::new(auto_chomp => 1, multiline_type => 'join_last');
}
sub is_line_continued {
my ( $self, $line ) = @_;
return 0 if not defined $line;
return $line =~ SPICE_LINE_CONTD;
}
sub join_last_line {
my ( $self, $last, $line ) = ( shift, shift, shift );
return $last if not defined $line;
$line =~ s/^[+]\s*/ /;
return $line if not defined $last;
return $last . $line;
}
sub save_record {
my ( $self, $line ) = @_;
return $self->abort_reading() if $line =~ SPICE_END_FILE;
$self->SUPER::save_record($line);
}
Try this parser with a SPICE deck with continuation characters and see what you get. Try having errors in the file. You may now write a more elaborate method for save_record
above and that could be used to parse a full SPICE file.
ERRORS AND EXCEPTIONS
This class adopts the principle that errors should be indicated with exceptions, and not a special return code or an error code. Several exceptions described in Text::Parser::Errors could be thrown when using Text::Parser
. All these are derived from Text::Parser::Errors::GenericError
. Since this is a Moose class, exceptions derived from Moose::Exception
will be thrown when methods of this class are used improperly.
In addition to these two types of exceptions, the developer can make her own exceptions. This example shows how she could create her own exceptions. We recommend Syntax::Keyword::Try (or Try::Tiny if you can't use that) to catch exceptions.
Since the handling of exceptions depends on their type, one could optionally build a handler routine using Dispatch::Class.
SEE ALSO
BUGS
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
CONTRIBUTORS
H.Merijn Brand - Tux <h.m.brand@xs4all.nl>
Mohammad S Anwar <mohammad.anwar@yahoo.com>