NAME
TextFileParser - an extensible Perl class to parse any text file by specifying grammar in derived classes.
VERSION
version 0.203
SYNOPSIS
use TextFileParser;
my $parser = new TextFileParser;
$parser->read(shift @ARGV);
print $parser->get_records, "\n";
The above code reads a text file and prints the content to STDOUT
.
DESCRIPTION
This class can be used to parse any arbitrary text file format.
TextFileParser
does all operations like open
file, close
file, line-count, and storage/deletion/retrieval of records. Future versions are expected to include progress-bar support. All these software features are file-format independent and can be re-used in parsing any text file format. Thus derived classes of TextFileParser
will be able to take advantage of these features without having to re-write the code again.
The Examples section describes how one could use inheritance to build a parser.
EXAMPLES
The following examples should illustrate the use of inheritance to parse various types of text file formats.
Basic principle
Derived classes simply need to override one method : save_record
. With the help of that any arbitrary file format can be read. save_record
should interpret the format of the text and store it in some form by calling SUPER::save_record
. The main::
program will then use the records and create an appropriate data structure with it.
Example 1 : A simple CSV Parser
We will write a parser for a simple CSV file that reads each line and stores the records as array references.
package CSVParser;
use parent 'TextFileParser';
sub save_record {
my ($self, $line) = @_;
chomp $line;
my (@fields) = split /,/, $line;
$self->SUPER::save_record(\@fields);
}
That's it! Now in main::
you can write the following.
use CSVParser;
my $csvp = new CSVParser;
$csvp->read(shift @ARGV);
Error checking
It is easy to add any error checks using exceptions. One of the easiest ways to do this is to use Exception::Class
.
package CSVParser;
use Exception::Class (
'CSVParser::Error',
'CSVParser::TooManyFields' => {
isa => 'CSVParser::Error',
},
);
use parent 'TextFileParser';
sub save_record {
my ($self, $line) = @_;
chomp $line;
my (@fields) = split /,/, $line;
my $self->{__csv_header} = \@fields if not scalar($self->get_records);
CSVParser::TooManyFields->throw(error => "Too many fields on " . $self->lines_parsed)
if scalar(@fields) > scalar(@{$self->{__csv_header}});
$self->SUPER::save_record(\@fields);
}
The TextFileParser
class will close all filehandles automatically as soon as an exception is thrown from save_record
. You can then catch the exception in main::
by use
ing Try::Tiny
.
Example 2 : Multi-line records
Many text file formats have some way to indicate line-continuation. In BASH and many other interpreted shell languages, a line continuation is indicated with a trailing back-slash (\). In SPICE syntax if a line starts with a '+'
character then it is to be treated as a continuation of the previous line.
To illustrate multi-line records we will write a derived class that simply joins the lines in a SPICE file and stores them as records.
package SPICELineJoiner;
use parent 'TextFileParser'l
sub save_record {
my ($self, $line) = @_;
$line = ($line =~ /^[+]\s*/) ? $self->__combine_with_last_record($line) : $line;
$self->SUPER::save_record( $line );
}
sub __combine_with_last_record {
my ($self, $line) = @_;
$line =~ s/^[+]\s*//;
my $last_rec = $self->pop_record;
chomp $last_rec;
return $last_rec . ' ' . $line;
}
Making roles instead
Line-continuation is a classic feature which is common to many different formats. If each syntax grammar generates a new class, one could potentially have to re-write code for line-continuation for each syntax or grammar. Instead it would be good to somehow re-use only the ability to join continued lines, but leave the actual syntax recognition to actual class that understands the syntax.
But if we separate this functionality into a class of its own line we did above with SPICELineJoiner
, then it gives an impression that we can now create an object of SPICELineJoiner
. But in reality an object of this class would have not have much functionality and is therefore limited.
This is where roles are very useful.
METHODS
new
Takes no arguments. Returns a blessed reference of the object.
my $pars = new TextFileParser;
This $pars
variable will be used in examples below.
read
Takes zero or one string argument containing the name of the file. Throws an exception if filename provided is either non-existent or cannot be read for any reason.
$pars->read($filename);
# The above is equivalent to the following
$pars->filename($anotherfile);
$pars->read();
Returns once all records have been read or if an exception is thrown for any parsing errors. This function will handle all open
and close
operations on all files even if any exception is thrown.
Recommendation: Don't override this subroutine. Override save_record
instead.
filename
Takes zero or one string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef if no file has been opened.
print "Last read ", $pars->filename, "\n";
lines_parsed
Takes no arguments. Returns the number of lines last parsed.
print $pars->lines_parsed, " lines were parsed\n";
This is also very useful for error message generation.
save_record
Takes exactly one argument which can be anything: SCALAR
, or ARRAYREF
, or HASHREF
or anything else meaningful. The important thing to remember is that exactly one record is saved per call. So if more than one argument are passed, everything after the first argument is ignored. And if no arguments are passed, then undef
is stored as a record.
In an application that uses a text parser, you will most-likely never call this method directly. It is automatically called within read
for each line. In this base class TextFileParser
, save_record
is simply called with a string containing the line text. Derived classes can decide to store records in a different form. See Inheritance examples for examples on how save_record
could be overridden for other text file formats.
get_records
Takes no arguments. Returns an array containing all the records that were read by the parser.
foreach my $record ( $pars->get_records ) {
$i++;
print "Record: $i: ", $record, "\n";
}
last_record
Takes no arguments and returns the last saved record. Leaves the saved records untouched.
my $last_rec = $pars->last_record;
pop_record
Takes no arguments and pops the last saved record.
my $last_rec = $pars->pop_record;
$uc_last = uc $last_rec;
$pars->save_record($uc_last);
BUGS
Please report any bugs or feature requests on the bugtracker website http://rt.cpan.org/Public/Dist/Display.html?Name=TextFileParser or by email to bug-textfileparser at rt.cpan.org.
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.