NAME
Text::Parser::Manual::ComparingWithNativePerl - A comparison of text parsing with native Perl and Text::Parser
VERSION
version 1.000
LIMITATIONS OF THE PERL ONE-LINER
When people compare Perl against AWK, the usual answer is this:
$ > perl -lane 'print;' file.txt
But the problem is that it isn't useful for anything more than just oneliners. Secondly, this cannot be used in a complex program. And even if you could write some code in a separate file, you cannot follow good programming practices like use strict
.
The Perl one-liner is surely not a useful solution for serious programs that have to parse the content of complex file formats. But if you're not convinced, we'll go through some examples here.
A SIMPLE EXAMPLE
To understand how Text::Parser compares to the native Perl way of doing things, let's take a simple example and see how we would write code. Let's say we have a simple text file (info.txt) with lines of information like this:
NAME: Brian
EMAIL: brian@webhost.net
ADDRESS: 401 Burnswick Ave, Cool City, UT 12345
NAME: Darin Cruz
ADDRESS: 209 Random St, Forest City, CA 92710
EMAIL: darin123@yahoo.co.uk
NAME: Elizabeth Andrews
ADDRESS: 0 Muutama Lane, Inaccessible Forest area, AK 88170
NAME: Audrey C. Miller
ADDRESS: 9 New St, Smart City, PA 12933
EMAIL: aud@audrey.io
You have to write code that would parse this to create a data structure with all names and corresponding email addresses.
{ name => "Brian", email => "brian@webhost.net", address => "401 Burnswick Ave, Cool City, UT 12345"},
.
.
.
The important thing to note is that NAME
, and ADDRESS
fields can be long strings.
Perl one-liner
Could we do this using a Perl one-liner?
perl -lane 'BEGIN {\
@data = ();\
}\
if($F[0] eq "NAME:") {\
shift @F;\
push @data, {name => join(' ', @F)};\
} elsif($F[0] eq "EMAIL:") {\
$d = pop @data; $d->{email} = $F[1];\
} elsif($F[0] eq "ADDRESS:") {\
$d = pop @data;\
shift @F; \
$d->{address} = join ' ', @F;\
}' info.txt
So much for a one-liner! But you can't make it shorter, can you?
Native Perl script
Here's an implementation in native Perl scipt:
open IN, "<info.txt";
my @data = ();
while(<IN>) {
chomp;
my (@field) = split /\s+/;
if ($field[0] eq 'NAME:') {
shift @field;
push @data, { name => join(' ', @field) };
} elsif($field[0] eq 'EMAIL:') {
$data[-1]->{email} = $field[1];
} elsif($field[0] eq 'ADDRESS:') {
shift @field;
$data[-1]->{email} = join ' ', @field;
}
}
close IN;
With Text::Parser
Here's how you'd write the same thing with Text::Parser.
use Text::Parser;
my $parser = Text::Parser->new();
$parser->add_rule( if => '$1 eq "NAME:"', do => 'return { name => ${2+} };' );
$parser->add_rule( if => '$1 eq "EMAIL:"',
do => 'my $rec = $this->pop_record; $rec->{email} = $2; return $rec;' );
$parser->add_rule( if => '$1 eq "ADDRESS:"',
do => 'my $rec = $this->pop_record; $rec->{email} = ${2+}; return $rec;' );
$parser->read('info.txt');
Quick observations
The programmer has to still specify how to extract data, but:
she can focus on the content rather than the mechanics of file handling
another programmer can instantly understand what is going on
the results can be used in a more complex program - not just a one-liner
parsing files has never been this intuiive, especially with shortcuts like
${2+}
Besides, did you notice the bug in the while
loop of the native Perl script above? It is hard to notice.
ANOTHER SIMPLE EXAMPLE
Take another simple example. Here we have new stuff in info.txt:
State: California
County: Santa Clara, 1304, San Jose, 2/18/1850
County: Alameda, 821, Oakland, 3/25/1853
County: San Mateo, 774, Redwood City, 4/19/1856
.
.
.
State: Arkansas
.
.
.
Let's say you have to parse this and form a data structure like this:
[
{
state => 'California',
'Santa Clara' => {area => 1304, county_seat => 'San Jose', date_inc => '2/18/1850'},
'Alameda' => {area => 821, county_seat => 'Oakland', date_inc => '3/25/1853'},
'San Mateo' => {area => 774, county_seat => 'Redwood City', date_inc => '4/19/1856'},
},
{
state => 'Arkansas',
...
}
]
Perl one-liner
It is clear that the one-liner is no longer really a one-liner. And you cannot use strict
. But go ahead and give it a try if you want.
Native Perl code
use String::Util 'trim';
open IN, "<info.txt";
my @data = ();
while(<IN>) {
chomp;
$_ = trim($_);
my (@field) = split /[:,]\s+/;
if ($field[0] eq 'State') {
push @data, { state => $field[1] };
} elsif($field[0] eq 'County') {
my $data = pop @data;
$data->{$field[1]} => {area => $field[2], county_seat => $field[3], date_inc => $field[4]};
push @data, $data;
}
}
close IN;
With Text::Parser
use Text::Parser;
my $parser = Text::Parser->new(auto_split => 1, FS => qr/[:,]\s+/);
$parser->add_rule(if => '$1 eq "State"', do => 'return {state => $2}');
$parser->add_rule(if => '$1 eq "County"',
do => 'my $data = $this->pop_record;
$data->{$2} = { area => $3, county_seat => $4, date_inc => $5, };
return $data;'
);
$parser->read('info.txt');
SOMETHING MORE FUN
Let's take something more fun. A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.
School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel
Student number, Score
0, 3
1, 7
Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora
Student number, Score
0, 6
1, 3
2, 9
School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna
Student number, Score
0, 8
1, 7
Grade = 2
Student number, Name
0, Harry
1, Hermione
Student number, Score
0, 5
1, 10
Grade = 3
Student number, Name
0, Fred
1, George
Student number, Score
0, 0
1, 0
You want to parse this into a data structure like this:
# Entries data-structure hierarchy is:
# school/grade/student number/Name
# school/grade/student number/Score
{
"Riverdale High" => {
"1" => {
0 => {Name => "Phoebe", Score => 3},
1 => {Name => "Rachel", Score => 7}
},
"2" => {
0 => {Name => "Angela", Score => 6},
1 => {Name => "Tristan", Score => 3},
2 => {Name => "Aurora", Score => 9},
},
},
},
{
"Hogwarts" => {
"1" => {
0 => {Name => "Ginny", Score => 8},
1 => {Name => "Luna", Score => 7},
},
"2" => {
0 => {Name => "Harry", Score => 5},
1 => {Name => "Hermione", Score => 10},
},
"3" => {
0 => {Name => "Fred", Score => 0},
1 => {Name => "George", Score => 0 },
},
},
}
This problem comes from a source where the solution was implemented in Python using a PEG parser.
Perl on-liner or Native Perl
Do I have to really do this? Why don't I let you try this yourself.
With Text::Parser
use Text::Parser;
my $parser = Text::Parser->new(FS => qr/\s+\=\s+|,\s+/);
$parser->add_rule(
if => '$1 eq "School"',
do => '~school = $2;',
dont_record => 1,
);
$parser->add_rule(
if => '$1 eq "Grade"',
do => '~grade = $2;',
dont_record => 1,
);
$parser->add_rule(
if => '$1 eq "Student number"',
do => '~info = $2;',
dont_record => 1
);
$parser->add_rule(
do => 'my $p = $this->pop_record;
$p->{~school}{~grade}{$1}{~info} = $2;
return $p;'
);
$parser->read('info.txt');
That's it! Just notice how elegant it looks.
By now, you should have concluded that the Text::Parser way is much better. If not, you must know a better solution and perhaps you should make a Perl module (or feel free to contact me and contribute if you like this project).
PERFORMANCE
There will be a compile-time penalty in using Text::Parser. If compile-time performance is important for you, this package is not for you.
But run-time performance is something I want to improve and am committed to working on improving. I have to admit that run-time performance is slower than native Perl. But I know where there is scope to improve on runtime, and will come up with some statistics on that.
For now, you should assume that Text::Parser
takes roughly 2x run-time. Earlier versions where about 5x slower than native Perl.
BUGS
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.