NAME

Text::Parser - Simplifies text parsing. Easily extensible to parse any text format.

VERSION

version 0.927

SYNOPSIS

use Text::Parser;

my $parser = Text::Parser->new();
$parser->read(shift);
print $parser->get_records, "\n";

The above code prints the content of the file (named in the first argument) to STDOUT.

my $parser = Text::Parser->new();
$parser->add_rule(do => 'print');
$parser->read(shift);

This example also dones the same as the earlier one. For more complex examples see the manual.

OVERVIEW

The need for this class stems from the fact that text parsing is the most common thing that programmers do, and yet there is no lean, simple way to do it efficiently. Most programmers still write boilerplate code with a while loop.

Instead Text::Parser allows programmers to parse text with terse, self-explanatory rules, whose structure is very similar to AWK, but extends beyond the capability of AWK. Incidentally, AWK is one of the ancestors of Perl! One would have expected Perl to extend the capabilities of AWK, although that's not really the case. Command-line perl -lane or even perl -lan script.pl are very limited in what they can do. Programmers cannot use them for serious projects. And parsing text files in regular Perl involves writing the same while loop again. This website summarizes the options available in Perl so far.

With Text::Parser, a developer can focus on specifying a grammar and then simply read the file. The read method automatically runs each rule collecting records from the text input into an array internally. And finally get_records can retrieve the records. Thus the programmer now has the power of Perl to create complex data structures, along with the elegance of AWK to parse text files. The manuals illustrate this with examples.

CONSTRUCTOR

new

Takes optional attributes as in example below. See section ATTRIBUTES for a list of the attributes and their description.

my $parser = Text::Parser->new(
    auto_chomp      => 0,
    multiline_type  => 'join_last',
    auto_trim       => 'b',
    auto_split      => 1,
    FS              => qr/\s+/,
);

ATTRIBUTES

The attributes below can be used as options to the new constructor. Each attribute has an accessor with the same name.

auto_chomp

Read-write attribute. Takes a boolean value as parameter. Defaults to 0.

print "Parser will chomp lines automatically\n" if $parser->auto_chomp;

auto_split

Read-write boolean attribute. Defaults to 0 (false). Indicates if the parser will automatically split every line into fields.

If it is set to a true value, each line will be split into fields, and a set of methods (a quick list here) become accessible within the save_record method. These methods are documented in Text::Parser::AutoSplit.

auto_trim

Read-write attribute. The values this can take are shown under the new constructor also. Defaults to 'n' (neither side spaces will be trimmed).

$parser->auto_trim('l');       # 'l' (left), 'r' (right), 'b' (both), 'n' (neither) (Default)

FS

Read-write attribute that can be used to specify the field separator to be used by the auto_split feature. It must be a regular expression reference enclosed in the qr function, like qr/\s+|[,]/ which will split across either spaces or commas. The default value for this argument is qr/\s+/.

The name for this attribute comes from the built-in FS variable in the popular GNU Awk program.

$parser->FS( qr/\s+\(*|\s*\)/ );

FS can be changed in your implementation of save_record. But the changes would take effect only on the next line.

multiline_type

If the target text format allows line-wrapping with a continuation character, the multiline_type option tells the parser to join them into a single line. When setting this attribute, one must re-define two more methods.

By default, the read-write multiline_type attribute has a value of undef, i.e., the target text format will not have wrapped lines. It can be set to either 'join_next' or 'join_last'.

$parser->multiline_type(undef);
$parser->multiline_type('join_next');

my $mult = $parser->multiline_type;
print "Parser is a multi-line parser of type: $mult" if defined $mult;
  • If the target format allows line-wrapping to the next line, set multiline_type to join_next.

  • If the target format allows line-wrapping from the last line, set multiline_type to join_last.

  • To "slurp" a file into a single string, set multiline_type to join_last. In this special case, you don't need to re-define the is_line_continued and join_last_line methods.

METHODS

These are meant to be called from the ::main program or within subclasses. In general, don't override them - just use them.

add_rule

Takes a hash as input. The keys of this hash must be the attributes of the Text::Parser::Rule class constructor and the values should also meet the requirements of that constructor.

$parser->add_rule(do => '', dont_record => 1);                 # Empty rule: does nothing
$parser->add_rule(if => 'm/li/, do => 'print', dont_record);   # Prints lines with 'li'
$parser->add_rule( do => 'uc($3)' );                           # Saves records of upper-cased third elements

Calling this method without any arguments will throw an exception. The method internally sets the auto_split attribute.

clear_rules

Takes no arguments, returns nothing. Clears the rules that were added to the object.

$parser->clear_rules;

This is useful to be able to re-use the parser after a read call, to parse another text with another set of rules. The clear_rules method does clear even the rules set up by BEGIN_rule and END_rule.

BEGIN_rule

Takes a hash input like add_rule, but if and continue_to_next keys will be ignored.

$parser->BEGIN_rule(do => '~count = 0;');
  • Since any if key is ignored, the do key is always evaluated. Multiple calls to BEGIN_rule will append to the previous calls; meaning, the actions of previous calls will be included.

  • The BEGIN block is mainly used to initialize some variables. So by default dont_record is set true. User can change this and set dont_record as false, thus forcing a record to be saved.

END_rule

Takes a hash input like add_rule, but if and continue_to_next keys will be ignored. Similar to BEGIN_rule, but the actions in the END_rule will be executed at the end of the read method.

$parser->END_rule(do => 'print ~count, "\n";');
  • Since any if key is ignored, the do key is always evaluated. Multiple calls to END_rule will append to the previous calls; meaning, the actions of previous calls will be included.

  • The END block is mainly used to do final processing of collected records. So by default dont_record is set true. User can change this and set dont_record as false, thus forcing a record to be saved.

read

Takes a single optional argument that can be either a string containing the name of the file, or a filehandle reference (a GLOB) like \*STDIN or an object of the FileHandle class.

$parser->read($filename);         # Read the file
$parser->read(\*STDIN);           # Read the filehandle

The above could also be done in two steps if the developer so chooses.

$parser->filename($filename);
$parser->read();                  # equiv: $parser->read($filename)

$parser->filehandle(\*STDIN);
$parser->read();                  # equiv: $parser->read(\*STDIN)

The method returns once all records have been read, or if an exception is thrown, or if reading has been aborted with the abort_reading method.

Any close operation will be handled (even if any exception is thrown), as long as read is called with a file name parameter - not if you call with a file handle or GLOB parameter.

$parser->read('myfile.txt');      # Will close file automatically

open MYFH, "<myfile.txt" or die "Can't open file myfile.txt at ";
$parser->read(\*MYFH);            # Will not close MYFH
close MYFH;

Note: To extend the class to other text formats, override save_record.

filename

Takes an optional string argument containing the name of a file. Returns the name of the file that was last opened if any. Returns undef if no file has been opened.

print "Last read ", $parser->filename, "\n";

The value stored is "persistent" - meaning that the method remembers the last file that was read.

$parser->read(shift @ARGV);
print $parser->filename(), ":\n",
      "=" x (length($parser->filename())+1),
      "\n",
      $parser->get_records(),
      "\n";

A read call with a filehandle, will clear the last file name.

$parser->read(\*MYFH);
print "Last file name is lost\n" if not defined $parser->filename();

filehandle

Takes an optional argument, that is a filehandle GLOB (such as \*STDIN) or an object of the FileHandle class. Returns the filehandle last saved, or undef if none was saved.

my $fh = $parser->filehandle();

Like filename, filehandle is also "persistent". Its old value is lost when either filename is set, or read is called with a filename.

$parser->read(\*STDOUT);
my $lastfh = $parser->filehandle();          # Will return glob of STDOUT

lines_parsed

Takes no arguments. Returns the number of lines last parsed. Every call to read, causes the value to be auto-reset.

print $parser->lines_parsed, " lines were parsed\n";

has_aborted

Takes no arguments, returns a boolean to indicate if text reading was aborted in the middle.

print "Aborted\n" if $parser->has_aborted();

get_records

Takes no arguments. Returns an array containing all the records saved by the parser.

foreach my $record ( $parser->get_records ) {
    $i++;
    print "Record: $i: ", $record, "\n";
}

pop_record

Takes no arguments and pops the last saved record.

my $last_rec = $parser->pop_record;
$uc_last = uc $last_rec;
$parser->save_record($uc_last);

last_record

Takes no arguments and returns the last saved record. Leaves the saved records untouched.

my $last_rec = $parser->last_record;

USE ONLY IN RULES AND SUBCLASS

Do NOT override these methods. They are valid only within a subclass, inside the user-implementation of methods described under OVERRIDE IN SUBCLASS.

this_line

Takes no arguments, and returns the current line being parsed. For example:

sub save_record {
    # ...
    do_something($self->this_line);
    # ...
}

abort_reading

Takes no arguments. Returns 1. To be used only in the derived class to abort read in the middle.

sub save_record {
    # ...
    $self->abort_reading if some_condition($self->this_line);
    # ...
}

push_records

This is useful if one needs to implement an include-like command in some text format. The example below illustrates this.

package OneParser;
use Moose;
extends 'Text::Parser';

my save_record {
    # ...
    # Under some condition:
    my $parser = AnotherParser->new();
    $parser->read($some_file)
    $parser->push_records($parser->get_records);
    # ...
}

Other methods available on auto_split

When the auto_split attribute is on, (or if it is turned on later), the following additional methods become available:

OVERRIDE IN SUBCLASS

The following methods should never be called in the ::main program. They may be overridden (or re-defined) in a subclass.

save_record

This method may be re-defined in a subclass to parse the target text format. The default implementation takes a single argument and stores it as a record. If no arguments are passed, undef is stored as a record. Note that unlike earlier versions of Text::Parser it is not required to override this method in your derived class. You can simply use the rules instead.

For a developer re-defining save_record, in addition to this_line, six additional methods become available if the auto_split attribute is set. These methods are described in greater detail in Text::Parser::AutoSplit, and they are accessible only within save_record.

Note: Developers may store records in any form - string, array reference, hash reference, complex data structure, or an object of some class. The program that reads these records using get_records has to interpret them. So developers should document the records created by their own implementation of save_record.

PARSING LINE-WRAPPED FILES

These methods are useful when parsing line-wrapped files, i.e., if the target text format allows wrapping the content of one line into multiple lines. In such cases, you should extend the Text::Parser class and override the following methods.

is_line_continued

If the target text format supports line-wrapping, the developer must override and implement this method. Your method should take a string argument and return a boolean indicating if the line is continued or not.

There is a default implementation shipped with this class with return values as follows:

multiline_type    |    Return value
------------------+---------------------------------
undef             |         0
join_last         |    0 for first line, 1 otherwise
join_next         |         1

join_last_line

Again, the developer should implement this method. This method should take two strings, join them while removing any continuation characters, and return the result. The default implementation just concatenates two strings and returns the result without removing anything (not even chomp). See Text::Parser::Multiline for more on this.

EXAMPLES

You can find example code in Text::Parser::Manual::ComparingWithNativePerl.

THINGS TO BE DONE

This package is still a work in progress. Future versions are expected to include features to:

  • read and parse from a buffer

  • automatically uncompress input

  • suggestions welcome ...

Contributions and suggestions are welcome and properly acknowledged.

SEE ALSO

BUGS

Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

CONTRIBUTORS

  • H.Merijn Brand - Tux <h.m.brand@xs4all.nl>

  • Mohammad S Anwar <mohammad.anwar@yahoo.com>