NAME

Text::Parser - Bundles common text file reading, extensible to read any arbitrary text grammar. Stop re-writing mundane code to parse your next text file. This module supersedes the older and now defunct TextFileParser.

VERSION

version 0.700

SYNOPSIS

use Text::Parser;

my $parser = Text::Parser->new();
$parser->read(shift @ARGV);
print $parser->get_records, "\n";

The above code treats the first command-line argument as a filename, and assuming it is a text file, it will print the content of the file to STDOUT.

RATIONALE

A simple text parser should have to only specify the "grammar" of the format it intends to read in the form of a few routines. Everything else, like opening a file handle, reading line by line, or tracking how many lines have been read, should be "automatic". Unfortunately, that's not how most programs seem to work. Most programmers spend (waste?) time writing code that calls open, close, etc., and must keep track of things that should have been simple features of every text file parser. And if they have to read multiple files, usually, the calls to open, close, and other things are repeated, and one has to repeat the checks for readability etc. This is an utter waste of time.

Text::Parser does all mundane operations like open file, close file, line-count, and storage/deletion/retrieval of records, etc. You don't have to bother with all that when you write a parser for your favorite text file format. Instead you can usually override just one method (save_records) and voila! you have a parser. Look at these examples to see how easy this can be.

DESCRIPTION

Text::Parser is a bare-bones text file parsing class. It is actually ignorant of the file format, and cannot recognize any grammars, but derived classes that inherit from it can specify this. They can do this by overriding some of the methods in this class.

Future versions are expected to include progress-bar support. All these software features are file-format independent and can be re-used in parsing any text file format. Thus derived classes of Text::Parser will be able to take advantage of these features without having to re-write the code again.

EXAMPLES

The following examples should illustrate the use of inheritance to parse various types of text file formats.

Basic principle

Derived classes simply need to override one method : save_record. With the help of that any arbitrary file format can be read. save_record should interpret the format of the text and store it in some form by calling SUPER::save_record. The main:: program will then use the records and create an appropriate data structure with it.

Example 1 : A simple CSV Parser

We will write a parser for a simple CSV file that reads each line and stores the records as array references.

package Text::Parser::CSV;
use parent 'Text::Parser';

sub save_record {
    my ($self, $line) = @_;
    chomp $line;
    my (@fields) = split /,/, $line;
    $self->SUPER::save_record(\@fields);
}

That's it! Now in main:: you can write something like this:

use Text::Parser::CSV;

my $csvp = Text::Parser::CSV->new();
$csvp->read(shift @ARGV);
foreach my $aref ($csvp->get_records) {
    my (@arr) = @{$aref};
    print "@arr\n";
}

The above program reads the content of a given CSV file and prints the content out in space-separated form.

Error checking

It is easy to add any error checks using exceptions. One of the easiest ways to do this is to use Exception::Class. We'll demonstrate the use for the CSV parser.

package Text::Parser::CSV;
use Exception::Class (
    'Text::Parser::CSV::Error', 
    'Text::Parser::CSV::TooManyFields' => {
        isa => 'Text::Parser::CSV::Error',
    },
);

use parent 'Text::Parser';

sub save_record {
    my ($self, $line) = @_;
    chomp $line;
    my (@fields) = split /,/, $line;
    $self->{__csv_header} = \@fields if not scalar($self->get_records);
    Text::Parser::CSV::TooManyFields->throw(error => "Too many fields on line #" . $self->lines_parsed)
        if scalar(@fields) > scalar(@{$self->{__csv_header}});
    $self->SUPER::save_record(\@fields);
}

The Text::Parser class will close all filehandles automatically as soon as an exception is thrown from save_record. You can catch the exception in main:: as you would normally, by useing Try::Tiny or other such class.

Example 2 : Multi-line records

Many text file formats have some way to indicate line-continuation. In BASH and many other interpreted shell languages, a line continuation is indicated with a trailing back-slash (\). In SPICE syntax if a line starts with a '+' character then it is to be treated as a continuation of the previous line.

To illustrate multi-line records we will write a derived class that simply joins the lines in a SPICE file and stores them as records.

package Text::Parser::LineContinuation::Spice;
use parent 'Text::Parser'l

sub save_record {
    my ($self, $line) = @_;
    $line = ($line =~ /^[+]\s*/) ? $self->__combine_with_last_record($line) : $line;
    $self->SUPER::save_record( $line );
}

sub __combine_with_last_record {
    my ($self, $line) = @_;
    $line =~ s/^[+]\s*//;
    my $last_rec = $self->pop_record;
    chomp $last_rec;
    return $last_rec . ' ' . $line;
}

Making roles instead

Line-continuation is a classic feature which is common to many different formats. If each syntax grammar generates a new class, one could potentially have to re-write code for line-continuation for each syntax or grammar. Instead it would be good to somehow re-use only the ability to join continued lines, but leave the actual syntax recognition to actual class that understands the syntax.

But if we separate this functionality into a class of its own line we did above with Text::Parser::LineContinuation::Spice, then it gives an impression that we can now create an object of Text::Parser::LineContinuation::Spice. But in reality an object of this class would have not have much functionality and is therefore limited.

This is where roles are very useful.

METHODS

new

Takes no arguments. Returns a blessed reference of the object.

my $parser = Text::Parser->new();

This $parser variable will be used in examples below.

read

Takes zero or one argument which could be a string containing the name of the file, or a filehandle reference or a GLOB (e.g. \*STDIN). Throws an exception if filename provided is either non-existent or cannot be read for any reason. Or if the argument supplied is a filehandle reference, and it happens to be opened for write instead of read, then too this method will thrown an exception.

$parser->read($filename);

# The above is equivalent to the following
$parser->filename($filename);
$parser->read();

# You can also read from a previously opened file handle directly
$parser->filehandle(\*STDIN);
$parser->read();

Returns once all records have been read or if an exception is thrown for any parsing errors, or if reading has been aborted with the abort_reading method. This function will handle all open and close operations on all files even if any exception is thrown, or if the reading has been aborted.

Once the method has successfully completed, you can parse another file. This means that your parser object is not tied to the file you parse. And when you do read a new file or input stream with this read method, you will lose all the records stored from the previous read operation. So this means that if you want to read a different file with the same parser object, (unless you don't care about the records from the last file you read) you should use the get_records method to retrieve all the read records before parsing a new file. So all those calls to read in the example above were parsing three different files, and each successive call overwrote the records from the previous call.

$parser->read($file1);
my (@records) = $parser->get_records();

$parser->read(\*STDIN);
my (@stdin) = $parser->get_records();

Inheritance Recommendation: When inheriting this class (which is what you should do if you want to write a parser for your favorite text file format), don't override this method. Override save_record instead.

Note: Reading from output file handles is a weird thing. Some Operating Systems allow this and some don't. Read the documentation for filehandle for more on this.

filename

Takes zero or one string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef if no file has been opened.

print "Last read ", $parser->filename, "\n";

filehandle

Takes zero or one GLOB argument and saves it for future a read call. Returns the filehandle last saved, or undef if none was saved. Remember that after a successful read call, filehandles are lost.

my $fh = $parser->filehandle();

Note: As such there is a check to ensure one is not supplying a write-only filehandle. For example, if you specify the filehandle of a write-only file or if the file is opened for write and you cannot read from it (here is where some Operating Systems can be different - you can actually read from a file that was opened for writing), then this method will throw an exception. But its behavior is heavily dependent on the Operating System. So don't rely on it catching any issues.

lines_parsed

Takes no arguments. Returns the number of lines last parsed.

print $parser->lines_parsed, " lines were parsed\n";

This is also very useful for error message generation.

save_record

Takes exactly one argument and that is is saved as a record. If more than one argument are passed, everything after the first argument is ignored. And if no arguments are passed, then undef is stored as a record.

In an application that uses a text parser, you will most-likely never call this method directly. It is automatically called within read for each line. In this base class Text::Parser, save_record is simply called with a string containing the raw line of text ; the line of text will not be chomped or modified in any way. Derived classes can decide to store records in a different form. A derived class could, for example, store the records in the form of hash references (so that when you use get_records, you'd get an array of hashes), or maybe even another array reference (so when you use get_records to you'd get an array of arrays). See Inheritance examples for examples on how save_record could be overridden by derived classes.

abort_reading

This method will be useful if a derived class wants to stop reading a file after it has read all the desired information. For example:

package Text::Parser::SomeFile;
use parent 'Text::Parser';

sub save_record {
    my ($self, $line) = @_;
    my ($leading, $rest) = split /\s+/, $line, 2;
    return $self->abort_reading() if $leading eq '**ABORT';
    return $self->SUPER::save_record($line);
}

In this derived class, we have a parser Text::Parser::SomeFile that would save each line as a record, but would abort reading the rest of the file as soon as it reaches a line with **ABORT as the first word. When this parser is given the following file as input:

somefile.txt:

Some text is here.
More text here.
**ABORT reading
This text is not read
This text is not read
This text is not read
This text is not read

You can now write a program as follows:

use Text::Parser::SomeFile;

my $par = Text::Parser::SomeFile->new();
$par->read('somefile.txt');
print $par->get_records(), "\n";

The output will be:

Some text is here.
More text here.

get_records

Takes no arguments. Returns an array containing all the records that were read by the parser.

foreach my $record ( $parser->get_records ) {
    $i++;
    print "Record: $i: ", $record, "\n";
}

last_record

Takes no arguments and returns the last saved record. Leaves the saved records untouched.

my $last_rec = $parser->last_record;

pop_record

Takes no arguments and pops the last saved record.

my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);

BUGS

Please report any bugs or feature requests on the bugtracker website http://rt.cpan.org/Public/Dist/Display.html?Name=Text-Parser or by email to bug-text-parser at rt.cpan.org.

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.