NAME

Text::Parser::Manual::ComparingWithNativePerl - A comparison of text parsing with native Perl and Text::Parser

VERSION

version 1.000

LIMITATIONS OF THE PERL ONE-LINER

When people compare Perl against AWK, the usual answer is this:

$ > perl -lane 'print;' file.txt

But the problem is that it isn't useful for anything more than just oneliners. Secondly, this cannot be used in a complex program. And even if you could write some code in a separate file, you cannot follow good programming practices like use strict.

The Perl one-liner is surely not a useful solution for serious programs that have to parse the content of complex file formats. But if you're not convinced, we'll go through some examples here.

A SIMPLE EXAMPLE

To understand how Text::Parser compares to the native Perl way of doing things, let's take a simple example and see how we would write code. Let's say we have a simple text file (info.txt) with lines of information like this:

NAME: Brian
EMAIL: brian@webhost.net
ADDRESS: 401 Burnswick Ave, Cool City, UT 12345
NAME: Darin Cruz
ADDRESS: 209 Random St, Forest City, CA 92710
EMAIL: darin123@yahoo.co.uk
NAME: Elizabeth Andrews
ADDRESS: 0 Muutama Lane, Inaccessible Forest area, AK 88170
NAME: Audrey C. Miller
ADDRESS: 9 New St, Smart City, PA 12933
EMAIL: aud@audrey.io

You have to write code that would parse this to create a data structure with all names and corresponding email addresses.

{ name => "Brian", email => "brian@webhost.net", address => "401 Burnswick Ave, Cool City, UT 12345"}, 
.
.
.

The important thing to note is that NAME, and ADDRESS fields can be long strings.

Perl one-liner

Could we do this using a Perl one-liner?

perl -lane 'BEGIN {\
    @data = ();\
    }\
    if($F[0] eq "NAME:") {\
        shift @F;\
        push @data, {name => join(' ', @F)};\
    } elsif($F[0] eq "EMAIL:") {\
        $d = pop @data; $d->{email} = $F[1];\
    } elsif($F[0] eq "ADDRESS:") {\
        $d = pop @data;\
        shift @F; \
        $d->{address} = join ' ', @F;\
    }' info.txt

So much for a one-liner! But you can't make it shorter, can you?

Native Perl script

Here's an implementation in native Perl scipt:

open IN, "<info.txt";
my @data = ();
while(<IN>) {
    chomp;
    my (@field) = split /\s+/;
    if ($field[0] eq 'NAME:') {
        shift @field;
        push @data, { name => join(' ', @field) };
    } elsif($field[0] eq 'EMAIL:') {
        $data[-1]->{email} = $field[1];
    } elsif($field[0] eq 'ADDRESS:') {
        shift @field;
        $data[-1]->{email} = join ' ', @field;
    }
}
close IN;

With Text::Parser

Here's how you'd write the same thing with Text::Parser.

use Text::Parser;

my $parser = Text::Parser->new();
$parser->add_rule( if => '$1 eq "NAME:"', do => 'return { name => ${2+} };' );
$parser->add_rule( if => '$1 eq "EMAIL:"',
    do => 'my $rec = $this->pop_record; $rec->{email} = $2; return $rec;' );
$parser->add_rule( if => '$1 eq "ADDRESS:"',
    do => 'my $rec = $this->pop_record; $rec->{email} = ${2+}; return $rec;' );
$parser->read('info.txt');

Quick observations

The programmer has to still specify how to extract data, but:

  • she can focus on the content rather than the mechanics of file handling

  • another programmer can instantly understand what is going on

  • the results can be used in a more complex program - not just a one-liner

  • parsing files has never been this intuiive, especially with shortcuts like ${2+}

Besides, did you notice the bug in the while loop of the native Perl script above? It is hard to notice.

ANOTHER SIMPLE EXAMPLE

Take another simple example. Here we have new stuff in info.txt:

State: California
County: Santa Clara, 1304, San Jose, 2/18/1850
County: Alameda, 821, Oakland, 3/25/1853
County: San Mateo, 774, Redwood City, 4/19/1856
.
.
.

State: Arkansas
.
.
.

Let's say you have to parse this and form a data structure like this:

[
    {
        state           => 'California', 
        'Santa Clara'   => {area => 1304, county_seat => 'San Jose', date_inc => '2/18/1850'}, 
        'Alameda'       => {area => 821, county_seat => 'Oakland', date_inc => '3/25/1853'}, 
        'San Mateo'     => {area => 774, county_seat => 'Redwood City', date_inc => '4/19/1856'}, 
    }, 
    {
        state           => 'Arkansas', 
        ...
    }
]

Perl one-liner

It is clear that the one-liner is no longer really a one-liner. And you cannot use strict. But go ahead and give it a try if you want.

Native Perl code

use String::Util 'trim';

open IN, "<info.txt";
my @data = ();
while(<IN>) {
    chomp;
    $_ = trim($_);
    my (@field) = split /[:,]\s+/;
    if ($field[0] eq 'State') {
        push @data, { state => $field[1] };
    } elsif($field[0] eq 'County') {
        my $data = pop @data;
        $data->{$field[1]} => {area => $field[2], county_seat => $field[3], date_inc => $field[4]};
        push @data, $data;
    }
}
close IN;

With Text::Parser

use Text::Parser;

my $parser = Text::Parser->new(auto_split => 1, FS => qr/[:,]\s+/);
$parser->add_rule(if => '$1 eq "State"', do => 'return {state => $2}');
$parser->add_rule(if => '$1 eq "County"',
    do => 'my $data = $this->pop_record;
    $data->{$2} = { area => $3, county_seat => $4, date_inc => $5, };
    return $data;'
);
$parser->read('info.txt');

SOMETHING MORE FUN

Let's take something more fun. A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.

School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel

Student number, Score
0, 3
1, 7

Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora

Student number, Score
0, 6
1, 3
2, 9

School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna

Student number, Score
0, 8
1, 7

Grade = 2
Student number, Name
0, Harry
1, Hermione

Student number, Score
0, 5
1, 10

Grade = 3
Student number, Name
0, Fred
1, George

Student number, Score
0, 0
1, 0 

You want to parse this into a data structure like this:

# Entries data-structure hierarchy is:
#   school/grade/student number/Name
#   school/grade/student number/Score
{
    "Riverdale High" => {
        "1" => {
            0 => {Name => "Phoebe", Score => 3}, 
            1 => {Name => "Rachel", Score => 7}
        }, 
        "2" => {
            0 => {Name => "Angela", Score => 6}, 
            1 => {Name => "Tristan", Score => 3}, 
            2 => {Name => "Aurora", Score => 9}, 
        }, 
    }, 
}, 
{
    "Hogwarts" => {
        "1" => {
            0 => {Name => "Ginny", Score => 8}, 
            1 => {Name => "Luna", Score => 7}, 
        }, 
        "2" => {
            0 => {Name => "Harry", Score => 5}, 
            1 => {Name => "Hermione", Score => 10}, 
        }, 
        "3" => {
            0 => {Name => "Fred", Score => 0}, 
            1 => {Name => "George", Score => 0 }, 
        },
    }, 
}

This problem comes from a source where the solution was implemented in Python using a PEG parser.

Perl on-liner or Native Perl

Do I have to really do this? Why don't I let you try this yourself.

With Text::Parser

use Text::Parser;

my $parser = Text::Parser->new(FS => qr/\s+\=\s+|,\s+/);
$parser->add_rule(
    if          => '$1 eq "School"',
    do          => '~school = $2;', 
    dont_record => 1, 
);
$parser->add_rule(
    if          => '$1 eq "Grade"',
    do          => '~grade = $2;', 
    dont_record => 1, 
);
$parser->add_rule(
    if          => '$1 eq "Student number"',
    do          => '~info = $2;',
    dont_record => 1
);
$parser->add_rule(
    do => 'my $p = $this->pop_record;
    $p->{~school}{~grade}{$1}{~info} = $2;
    return $p;'
);
$parser->read('info.txt');

That's it! Just notice how elegant it looks.

By now, you should have concluded that the Text::Parser way is much better. If not, you must know a better solution and perhaps you should make a Perl module (or feel free to contact me and contribute if you like this project).

PERFORMANCE

There will be a compile-time penalty in using Text::Parser. If compile-time performance is important for you, this package is not for you.

But run-time performance is something I want to improve and am committed to working on improving. I have to admit that run-time performance is slower than native Perl. But I know where there is scope to improve on runtime, and will come up with some statistics on that.

For now, you should assume that Text::Parser takes roughly 2x run-time. Earlier versions where about 5x slower than native Perl.

Table of contents | Next

BUGS

Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.