NAME
Text::Parser - Bundles common text reading tasks. Stop re-writing mundane code to parse your next text file. This module supersedes the older and now defunct TextFileParser.
VERSION
version 0.751
SYNOPSIS
use Text::Parser;
my $parser = Text::Parser->new();
$parser->read(shift @ARGV);
print $parser->get_records, "\n";
The above code reads the first command-line argument as a string, and assuming it is the name of a text file, it will print the content of the file to STDOUT
. If the string is not the name of a text file it will throw an exception and exit.
RATIONALE
A simple text file parser should have to only specify the "grammar" of the format it intends to read in the form of a few routines. Everything else, like opening a file handle, reading line by line, or tracking how many lines have been read, should be "automatic". Unfortunately, that's not how most programs seem to work. Most programmers spend (waste?) time writing code that calls open
, close
, etc., and must keep track of things that should have been simple features of every text file parser. And if they have to read multiple files, usually, the calls to open
, close
, and other things are repeated, and one has to repeat the checks for readability etc. This is an utter waste of time.
Text::Parser
does all mundane operations like open
file, close
file, line-count, and storage/deletion/retrieval of records, etc. You don't have to bother with all that when you write a parser for your favorite text file format. Instead you can usually override just one method (save_record
) and voila! you have a parser. Look at these examples to see how easy this can be.
Once you have used read
to successfully parse, you can parse another input. This means that your parser object is not "married" to what you parse ; it just parses and collects records, which must then be processed by another program to do what it needs to do with it. This allows you the flexibility to potentially write multi-threaded code: you can parse your input in one thread and create data structures in another thread.
DESCRIPTION
Text::Parser
is a bare-bones text parsing class. It is actually ignorant of the text format, and cannot recognize any grammars, but derived classes that inherit from it can specify this. They can do this by overriding some of the methods in this class.
Future versions are also expected to include progress-bar support. All these software features are text-format independent and can be re-used in parsing any text format. Thus derived classes of Text::Parser
will be able to take advantage of these features without having to re-write the code again.
At present this class handles files as input. You could either give filenames or filehandles (GLOB
s) as input to the parser. In the future the class may include the ability to read from other input sources. This will be especially useful if you have a series of files/sockets to read from.
METHODS
new
Takes no arguments. Returns a blessed reference of the object.
my $parser = Text::Parser->new();
This $parser
variable will be used in examples below.
read
Takes zero or one argument which could be a string containing the name of the file, or a filehandle reference or a GLOB
(e.g. \*STDIN
). Throws an exception if filename/GLOB
provided is either non-existent or cannot be read for any reason.
Note: Normally if you provide the GLOB
of a file opened for write, some Operating Systems allow reading from it too, and some don't. Read the documentation for filehandle
for more on this.
$parser->read($filename);
# The above is equivalent to the following
$parser->filename($filename);
$parser->read();
# You can also read from a previously opened file handle directly
$parser->filehandle(\*STDIN);
$parser->read();
Returns once all records have been read or if an exception is thrown for any parsing errors, or if reading has been aborted with the abort_reading
method.
If you provide a string file name as input, the function will handle all open
and close
operations on files even if any exception is thrown, or if the reading has been aborted. But if you pass a file handle GLOB
instead, then the file handle won't be closed and it will be the responsibility of the calling program to close the filehandle.
$parser->read('myfile.txt'); # Will handle open, parsing, and closing of file automatically.
open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
$parser->read(\*MYFH); # Will not close MYFH and it is the respo
close MYFH;
When you do read a new file or input stream with this method, you will lose all the records stored from the previous read operation. So this means that if you want to read a different file with the same parser object, (unless you don't care about the records from the last file you read) you should use the get_records
method to retrieve all the read records before parsing a new file. So all those calls to read
in the example above were parsing three different files, and each successive call overwrote the records from the previous call.
$parser->read($file1);
my (@records) = $parser->get_records();
$parser->read(\*STDIN);
my (@stdin) = $parser->get_records();
Inheritance Recommendation: When inheriting this class (which is what you should do if you want to write a parser for your favorite text file format), don't override this method. Override save_record
instead.
Future Enhancement
At present the read
method takes only two possible inputs argument types, either a file name, or a file handle. In near future this may be enhanced to read from sockets, subroutines, or even just a block of memory (a string reference). Suggestions for other forms of input are welcome.
filename
Takes zero or one string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef
if no file has been opened.
print "Last read ", $parser->filename, "\n";
The file name is "persistent" in the object. Meaning, even after you have called read
once, it still remembers the file name. So you can do this:
$parser->read(shift @ARGV);
print $parser->filename(), ":\n", "=" x (length($parser->filename())+1), "\n", $parser->get_records(), "\n";
But if you do a read
with a filehandle as argument, you'll see that the last filename is lost - which makes sense.
$parser->read(\*MYFH);
print "Last file name is lost\n" if not defined $parser->filename();
filehandle
Takes zero or one GLOB
argument and saves it for future a read
call. Returns the filehandle last saved, or undef
if none was saved. Remember that after a successful read
call, filehandles are lost.
my $fh = $parser->filehandle();
Note: As such there is a check to ensure one is not supplying a write-only filehandle. For example, if you specify the filehandle of a write-only file or if the file is opened for write and you cannot read from it. The weird thing is that some of the standard filehandles like STDOUT
don't behave uniformly across all platforms. On most POSIX platforms, STDOUT
is readable. On such platforms you will not get any exceptions if you try to do this:
$parser->filehandle(\*STDOUT); ## Works on many POSIX platforms
## Throws exception on others
Like in the case of filename
method, if after you read
with a filehandle, you call read
again, this time with a file name, the last filehandle is lost.
my $lastfh = $parser->filehandle();
## Will return \*STDOUT
$parser->read('another.txt');
print "No filehandle saved any more\n" if not defined $parser->filehandle();
lines_parsed
Takes no arguments. Returns the number of lines last parsed. A line is reckoned when the \n
character is encountered.
print $parser->lines_parsed, " lines were parsed\n";
This is also very useful for error message generation.
Again the information in this is "persistent". Meaning, that after a read
operation, you can call it to get the number of lines parsed. You can also be assured that every time you call read
, the line count will be ways be reset first.
save_record
Takes exactly one argument and that is saved as a record. Additional arguments are ignored. If no arguments are passed, then undef
is stored as a record.
In an application that uses a text parser, you will most-likely never call this method directly. It is automatically called within read
for each line. In this base class Text::Parser
, save_record
is simply called with a string containing the raw line of text ; i.e. the line of text will not be chomp
ed or modified in any way. Here is a basic example.
Derived classes can decide to store records in a different form. A derived class could, for example, store the records in the form of hash references (so that when you use get_records
, you'd get an array of hashes), or maybe even another array reference (so when you use get_records
to you'd get an array of arrays). The CSV parser example does the latter by example.
abort_reading
Takes no arguments. Returns 1
. You will probably never call this method in your main program.
This method is usually used only in the derived class. See this example.
get_records
Takes no arguments. Returns an array containing all the records that were read by the parser.
foreach my $record ( $parser->get_records ) {
$i++;
print "Record: $i: ", $record, "\n";
}
last_record
Takes no arguments and returns the last saved record. Leaves the saved records untouched.
my $last_rec = $parser->last_record;
pop_record
Takes no arguments and pops the last saved record.
my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);
EXAMPLES
The following examples should illustrate the use of inheritance to parse various types of text file formats.
Basic principle
Derived classes simply need to override one method : save_record
. With the help of that any arbitrary file format can be read. save_record
should interpret the format of the text and store it in some form by calling SUPER::save_record
. The main::
program will then use the records and create an appropriate data structure with it.
Notice that the creation of a data structure is not the objective of a parser. It is simply concerned with collecting data and arranging it in a form that can be used. That's all. Data structures can be created by a different part of your program using the data collected by your parser.
Example 1 : A simple CSV Parser
We will write a parser for a simple CSV file that reads each line and stores the records as array references.
package Text::Parser::CSV;
use parent 'Text::Parser';
sub save_record {
my ($self, $line) = @_;
chomp $line;
my (@fields) = split /,/, $line;
$self->SUPER::save_record(\@fields);
}
That's it! Now in main::
you can write something like this:
use Text::Parser::CSV;
my $csvp = Text::Parser::CSV->new();
$csvp->read(shift @ARGV);
foreach my $aref ($csvp->get_records) {
my (@arr) = @{$aref};
print "@arr\n";
}
The above program reads the content of a given CSV file and prints the content out in space-separated form.
Example 2 : Error checking
It is easy to add any error checks using exceptions. One of the easiest ways to do this is to use Exception::Class
. We'll modify the CSV parser above to demonstrate that.
package Text::Parser::CSV;
use Exception::Class (
'Text::Parser::CSV::Error',
'Text::Parser::CSV::TooManyFields' => {
isa => 'Text::Parser::CSV::Error',
},
);
use parent 'Text::Parser';
sub save_record {
my ($self, $line) = @_;
chomp $line;
my (@fields) = split /,/, $line;
$self->{__csv_header} = \@fields if not scalar($self->get_records);
Text::Parser::CSV::TooManyFields->throw(error => "Too many fields on line #" . $self->lines_parsed)
if scalar(@fields) > scalar(@{$self->{__csv_header}});
$self->SUPER::save_record(\@fields);
}
The Text::Parser
class will close
all filehandles automatically as soon as an exception is thrown from save_record
. You can catch the exception in main::
as you would normally, by use
ing Try::Tiny
or other such class.
Example 3 : Aborting without errors
We can also abort parsing a text file without throwing an exception. This could be if we got the information we needed. For example:
package Text::Parser::SomeFile;
use parent 'Text::Parser';
sub save_record {
my ($self, $line) = @_;
my ($leading, $rest) = split /\s+/, $line, 2;
return $self->abort_reading() if $leading eq '**ABORT';
return $self->SUPER::save_record($line);
}
In this derived class, we have a parser Text::Parser::SomeFile
that would save each line as a record, but would abort reading the rest of the file as soon as it reaches a line with **ABORT
as the first word. When this parser is given the following file as input:
somefile.txt:
Some text is here.
More text here.
**ABORT reading
This text is not read
This text is not read
This text is not read
This text is not read
You can now write a program as follows:
use Text::Parser::SomeFile;
my $par = Text::Parser::SomeFile->new();
$par->read('somefile.txt');
print $par->get_records(), "\n";
The output will be:
Some text is here.
More text here.
BUGS
Please report any bugs or feature requests on the bugtracker website http://rt.cpan.org/Public/Dist/Display.html?Name=Text-Parser or by email to bug-text-parser at rt.cpan.org.
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.