NAME

Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.

VERSION

version 0.920

SYNOPSIS

use Text::Parser;

my $parser = Text::Parser->new();
$parser->read(shift);
print $parser->get_records, "\n";

The above code reads the first command-line argument as a string, and assuming it is the name of a text file, it will print the content of the file to STDOUT. If the string is not the name of a text file it will throw an exception and exit.

package MyParser;

use Moose;
extends 'Text::Parser';
# use parent 'Text::Parser'; 
# This will also work, but the Moose based class may be easier to implement

sub save_record {
    my $self = shift;
    ## ...
}

package main;

my $parser = MyParser->new(auto_split => 1, auto_chomp => 1, auto_trim => 'b');
$parser->read(shift);
foreach my $rec ($parser->get_records) {
    ## ...
}

The above example shows how Text::Parser could be easily extended to parse a specific text format.

RATIONALE

Text parsing is perhaps the single most common thing that almost every Perl program does. Yet we don't have a lean, flexible, text parsing utility. Ideally, the developer should only have to specify the "grammar" of the text file she intends to parse. Everything else, like opening a file handle, closeing the file handle, tracking line-count, joining continued lines into one, reporting any errors in line continuation, trimming white space, splitting each line into fields, etc., should be automatic.

Unfortunately however, most file parsing code looks like this:

open FH, "<$fname";
my $line_count = 0;
while (<FH>) {
    $line_count++;
    chomp;
    $_ = trim $_;  ## From String::Util
    my (@fields) = split /\s+/;
    # do something for each line ...
}
close FH;

Note that a developer may have to repeat all of the above if she has to read another file with different content or format. And if the target text format allows line-wrapping with a continuation character, it isn't easy to implement it well with this while loop.

With Text::Parser, developers can focus on specifying the grammar and simply use the read method. Just extend this class with extends (or inherit using parent pragma), and override one method (save_record). Voila! you have a parser. These examples illustrate how easy this can be.

OVERVIEW

Text::Parser is a format-agnostic text parsing base class. Derived classes can specify the format-specific syntax they intend to parse.

CONSTRUCTOR

new

Takes optional attributes as in example below. See section ATTRIBUTES for a list of the attributes and their description.

my $parser = Text::Parser->new(
    auto_chomp      => 0,
    multiline_type  => 'join_last',
    auto_trim       => 'b',
    auto_split      => 1,
    FS              => qr/\s+/,
);

ATTRIBUTES

The attributes below can be used as options to the new constructor. Each attribute has an accessor with the same name.

auto_chomp

Read-write attribute. Takes a boolean value as parameter. Defaults to 0.

print "Parser will chomp lines automatically\n" if $parser->auto_chomp;

auto_split

Read-write boolean attribute. Defaults to 0 (false). Indicates if the parser will automatically split every line into fields.

If it is set to a true value, each line will be split into fields, and a set of methods (a quick list here) become accessible within the save_record method. These methods are documented in Text::Parser::AutoSplit.

auto_trim

Read-write attribute. The values this can take are shown under the new constructor also. Defaults to 'n' (neither side spaces will be trimmed).

$parser->auto_trim('l');       # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)

FS

Read-write attribute that can be used to specify the field separator along with auto_split attribute. It must be a regular expression reference enclosed in the qr function, like qr/\s+|[,]/ which will split across either spaces or commas. The default value for this argument is qr/\s+/.

The name for this attribute comes from the built-in FS variable in the popular GNU Awk program.

$parser->FS( qr/\s+\(*|\s*\)/ );

FS can be changed in your implementation of save_record. But the changes would take effect only on the next line.

multiline_type

If the target text format allows line-wrapping with a continuation character, the multiline_type option tells the parser to join them into a single line. When setting this attribute, one must re-define two more methods. See these examples.

By default, the read-write multiline_type attribute has a value of undef, i.e., the target text format will not have wrapped lines. It can be set to either 'join_next' or 'join_last'.

$parser->multiline_type(undef);
$parser->multiline_type('join_next');

my $mult = $parser->multiline_type;
print "Parser is a multi-line parser of type: $mult" if defined $mult;
  • If the target format allows line-wrapping to the next line, set multiline_type to join_next. This example illustrates this case.

  • If the target format allows line-wrapping from the last line, set multiline_type to join_last. This simple SPICE line-joiner illustrates this case.

  • To "slurp" a file into a single string, set multiline_type to join_last. In this special case, you don't need to re-define the is_line_continued and join_last_line methods. See this trivial line-joiner example.

METHODS

These are meant to be called from the ::main program or within subclasses. In general, don't override them - just use them.

read

Takes a single optional argument that can be either a string containing the name of the file, or a filehandle reference (a GLOB) like \*STDIN or an object of the FileHandle class.

$parser->read($filename);         # Read the file
$parser->read(\*STDIN);           # Read the filehandle

The above could also be done in two steps if the developer so chooses.

$parser->filename($filename);
$parser->read();                  # equiv: $parser->read($filename)

$parser->filehandle(\*STDIN);
$parser->read();                  # equiv: $parser->read(\*STDIN)

The method returns once all records have been read, or if an exception is thrown, or if reading has been aborted with the abort_reading method.

Any close operation will be handled (even if any exception is thrown), as long as read is called with a file name parameter - not if you call with a file handle or GLOB parameter.

$parser->read('myfile.txt');      # Will close file automatically

open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
$parser->read(\*MYFH);            # Will not close MYFH
close MYFH;

Note: To extend the class to other text formats, override save_record.

filename

Takes an optional string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef if no file has been opened.

print "Last read ", $parser->filename, "\n";

The value stored is "persistent" - meaning that the method remembers the last file that was read.

$parser->read(shift @ARGV);
print $parser->filename(), ":\n",
      "=" x (length($parser->filename())+1),
      "\n",
      $parser->get_records(),
      "\n";

A read call with a filehandle, will clear the last file name.

$parser->read(\*MYFH);
print "Last file name is lost\n" if not defined $parser->filename();

filehandle

Takes an optional argument, that is a filehandle GLOB (such as \*STDIN) or an object of the FileHandle class. Returns the filehandle last saved, or undef if none was saved.

my $fh = $parser->filehandle();

Like filename, filehandle is also "persistent". Its old value is lost when either filename is set, or read is called with a filename.

$parser->read(\*STDOUT);
my $lastfh = $parser->filehandle();          # Will return glob of STDOUT

lines_parsed

Takes no arguments. Returns the number of lines last parsed. Every call to read, causes the value to be auto-reset.

print $parser->lines_parsed, " lines were parsed\n";

has_aborted

Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.

print "Aborted\n" if $parser->has_aborted();

get_records

Takes no arguments. Returns an array containing all the records saved by the parser.

foreach my $record ( $parser->get_records ) {
    $i++;
    print "Record: $i: ", $record, "\n";
}

pop_record

Takes no arguments and pops the last saved record.

my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);

last_record

Takes no arguments and returns the last saved record. Leaves the saved records untouched.

my $last_rec = $parser->last_record;

FOR USE IN SUBCLASS ONLY

Do NOT override these methods. They are valid only within a subclass, inside the user-implementation of methods described under OVERRIDE IN SUBCLASS.

this_line

Takes no arguments, and returns the current line being parsed. For example:

sub save_record {
    # ...
    do_something($self->this_line);
    # ...
}

abort_reading

Takes no arguments. Returns 1. To be used only in the derived class to abort read in the middle. See this example.

sub save_record {
    # ...
    $self->abort_reading if some_condition($self->this_line);
    # ...
}

push_records

This is useful if one needs to implement an include-like command in some text format. The example below illustrates this.

package OneParser;
use Moose;
extends 'Text::Parser';

my save_record {
    # ...
    # Under some condition:
    my $parser = AnotherParser->new();
    $parser->read($some_file)
    $parser->push_records($parser->get_records);
    # ...
}

Other methods available on auto_split

When the auto_split attribute is on, (or if it is turned on later), the following additional methods become available:

OVERRIDE IN SUBCLASS

The following methods should never be called in the ::main program. They are meant to be overridden (or re-defined) in a subclass.

save_record

This method should be re-defined in a subclass to parse the target text format. To save a record, the re-defined implementation in the derived class must call SUPER::save_record (or super if you're using Moose) with exactly one argument as a record. If no arguments are passed, undef is stored as a record.

For a developer re-defining save_record, in addition to this_line, six additional methods become available if the auto_split attribute is set. These methods are described in greater detail in Text::Parser::AutoSplit, and they are accessible only within save_record.

Note: Developers may store records in any form - string, array reference, hash reference, complex data structure, or an object of some class. The program that reads these records using get_records has to interpret them. So developers should document the records created by their own implementation of save_record.

FOR MULTI-LINE TEXT PARSING

These methods need to be re-defined by only multiline derived classes, i.e., if the target text format allows wrapping the content of one line into multiple lines. In most cases, you should re-define both methods. As usual, the this_line method may be used while re-defining them.

is_line_continued

This takes a string argument and returns a boolean indicating if the line is continued or not. See Text::Parser::Multiline for more on this.

The return values of the default method provided with this class are:

multiline_type    |    Return value
------------------+---------------------------------
undef             |         0
join_last         |    0 for first line, 1 otherwise
join_next         |         1

join_last_line

This method takes two strings, joins them while removing any continuation characters, and returns the result. The default implementation just concatenates two strings and returns the result without removing anything (not even chomp). See Text::Parser::Multiline for more on this.

EXAMPLES

Example 1 : A simple CSV Parser

We will write a parser for a simple CSV file that reads each line and stores the records as array references. This example is oversimplified, and does not handle embedded newlines.

package Text::Parser::CSV;
use Moose;
extends 'Text::Parser';
use Text::CSV;

my $csv;
sub save_record {
    my ($self, $line) = @_;
    $csv //= Text::CSV->new({ binary => 1, auto_diag => 1});
    $csv->parse($line);
    $self->SUPER::save_record([$csv->fields]);
}

That's it! Now in main:: you can write something like this:

use Text::Parser::CSV;

my $csvp = Text::Parser::CSV->new();
$csvp->read(shift @ARGV);
foreach my $aref ($csvp->get_records) {
    my (@arr) = @{$aref};
    print "@arr\n";
}

The above program reads the content of a given CSV file and prints the content out in space-separated form.

Example 2 : Error checking

Note: Read the documentation for Exceptions to learn about creating, throwing, and catching exceptions in Perl 5. All of the methods of creating, throwing, and catching exceptions described in Exceptions are supported.

You can throw exceptions from save_record in your subclass, for example, when you detect a syntax error. The read method will close all filehandles automatically as soon as an exception is thrown. The exception will pass through to ::main unless you catch and handle it in your derived class.

Here is an example showing the use of an exception to detect a syntax error in a file:

package My::Text::Parser;
use Exception::Class (
    'My::Text::Parser::SyntaxError' => {
        description => 'syntax error',
        alias => 'throw_syntax_error', 
    },
);

use Moose;
extends 'Text::Parser';

sub save_record {
    my ($self, $line) = @_;
    throw_syntax_error(error => 'syntax error') if _syntax_error($line);
    $self->SUPER::save_record($line);
}

Example 3 : Aborting without errors

We can also abort parsing a text file without throwing an exception. This could be if we got the information we needed. For example:

package SomeParser;
use Moose;
extends 'Text::Parser';

sub BUILDARGS {
    my $pkg = shift;
    return {auto_split => 1};
}

sub save_record {
    my ($self, $line) = @_;
    return $self->abort_reading() if $self->field(0) eq '**ABORT';
    return $self->SUPER::save_record($line);
}

Above is shown a parser SomeParser that would save each line as a record, but would abort reading the rest of the file as soon as it reaches a line with **ABORT as the first word. When this parser is given the following file as input:

somefile.txt:

Some text is here.
More text here.
**ABORT reading
This text is not read
This text is not read
This text is not read
This text is not read

You can now write a program as follows:

use SomeParser;

my $par = SomeParser->new();
$par->read('somefile.txt');
print $par->get_records(), "\n";

The output will be:

Some text is here.
More text here.

Example 4 : Multi-line parsing

Some text formats allow users to split a line into several lines with a line continuation character (usually at the end or the beginning of a line).

Trivial line-joiner

Below is a trivial example where all lines are joined into one:

use strict;
use warnings;
use Text::Parser;

my $join_all = Text::Parser->new(auto_chomp => 1, multiline_type => 'join_last');
$join_all->read('input.txt');
print $join_all->get_records(), "\n";

Another trivial example is here.

Continue with character

(Pun intended! ;-))

In the above example, all lines are joined (indiscriminately). But most often text formats have a continuation character that specifies that the line continues to the next line, or that the line is a continuation of the previous line. Here's an example parser that treats the back-slash (\) character as a line-continuation character:

package MyMultilineParser;
use Moose;
extends 'Text::Parser';
use strict;
use warnings;

sub new {
    my $pkg = shift;
    $pkg->SUPER::new(multiline_type => 'join_next');
}

sub is_line_continued {
    my $self = shift;
    my $line = shift;
    chomp $line;
    return $line =~ /\\\s*$/;
}

sub join_last_line {
    my $self = shift;
    my ($last, $line) = (shift, shift);
    chomp $last;
    $last =~ s/\\\s*$/ /g;
    return $last . $line;
}

1;

In your main::

use MyMultilineParser;
use strict;
use warnings;

my $parser = MyMultilineParser->new();
$parser->read('multiline.txt');
print "Read:\n"
print $parser->get_records(), "\n";

Try with the following input multiline.txt:

Garbage In.\
Garbage Out!

When you run the above code with this file, you should get:

Read:
Garbage In. Garbage Out!

Simple SPICE line joiner

Some text formats allow a line to indicate that it is continuing from a previous line. For example SPICE has a continuation character (+) on the next line, indicating that the text on that line should be joined with the previous line. Let's show how to build a simple SPICE line-joiner. To build a full-fledged parser you will have to specify the rich and complex grammar for SPICE circuit description.

use TrivialSpiceJoin;
use Moose;
extends 'Text::Parser';

use constant {
    SPICE_LINE_CONTD => qr/^[+]\s*/,
    SPICE_END_FILE   => qr/^\.end/i,
};

sub new {
    my $pkg = shift;
    $pkg->SUPER::new(auto_chomp => 1, multiline_type => 'join_last');
}

sub is_line_continued {
    my ( $self, $line ) = @_;
    return 0 if not defined $line;
    return $line =~ SPICE_LINE_CONTD;
}

sub join_last_line {
    my ( $self, $last, $line ) = ( shift, shift, shift );
    return $last if not defined $line;
    $line =~ s/^[+]\s*/ /;
    return $line if not defined $last;
    return $last . $line;
}

sub save_record {
    my ( $self, $line ) = @_;
    return $self->abort_reading() if $line =~ SPICE_END_FILE;
    $self->SUPER::save_record($line);
}

Try this parser with a SPICE deck with continuation characters and see what you get. Try having errors in the file. You may now write a more elaborate method for save_record above and that could be used to parse a full SPICE file.

THINGS TO BE DONE

Future versions are expected to include:

  • progress-bar support

  • parsing from a buffer

  • automatically uncompress input

  • suggestions welcome ...

Interested contributors welcome.

SEE ALSO

BUGS

Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

CONTRIBUTORS

  • H.Merijn Brand - Tux <h.m.brand@xs4all.nl>

  • Mohammad S Anwar <mohammad.anwar@yahoo.com>