NAME
Text::Parser::Manual::ExtendedAWKSyntax - A manual for ExAWK (extended AWK) syntax
VERSION
version 0.927
THE EXTENDED AWK LANGUAGE
So you saw the power of Text::Parser and want to write your own. First you need to learn something about the rules.
Why extend?
The AWK programming language does give us the flexibility to do a number of things. But it is limited in many respects. Below is a list of things that come to my mind:
AWK's regular expressions are limited. Perl is superior here and we want to leverage that.
You can't create deep data-structures (multi-dimensional arrays/hashes) in AWK. You can't create objects and classes.
Every rule will be tested and executed. It would be nice to control whether the next rule would be executed.
The UNIX version of AWK has only nine field identifiers
$1
through$9
. GAWK and other implementations remove this limitation.AWK has a limited set of built-in functions.
AWK itself cannot be used for much more than reading text files and processing them. It is not really useful for a more complex program.
Why AWK?
Despite its limitations, AWK is excellent for parsing and processing text input. And despite the fact that Perl is supposed to allow us to do something more advanced, parsing text files should be as easy as it is with AWK. So instead of re-inventing the wheel... (you get the point?).
BASIC SYNTAX
The basic syntax of the AWK program is:
condition { task; }
If condition
is specified, then the task
block is optional, and if the task
block is specified, then the condition
is optional.
The basic form of the ExAWK rule is like this:
if => 'condition', do => 'task'
## options:
## dont_record => 0|1
## continue_to_next => 0|1
These are normally supplied as arguments to add_rule
, BEGIN_rule
, and END_rule
.
Similar to AWK, if condition
is specified, then the task
is optional, and if the task
is specified, then the condition
is optional. The language for the condition
and task
is Perl (not AWK). So for example, to compare strings you should use eq
and not ==
like you would in AWK. If the 'if'
and 'do'
strings are transformed into regular Perl and compiled.
Simplicity
Just as in AWK the condition
can be as simple as a regular expression, or a complex boolean expression. In AWK you could do:
$ awk '/EMAIL:/ {print $2}' file.txt
In ExAWK you could write something like this to get something equivalent (Note the need for "\n"
):
if => 'm/EMAIL:/', do => 'print $2, "\n"'
The default condition
in ExAWK is just like in AWK: true for each input line. The following will simply print every line in a file:
do => 'print'
The default task
in AWK is print
. Thus:
$ awk '/li/' file.txt
will print all the lines with 'li'
somewhere in it. But in ExAWK, since it integrates with the Text::Parser class, the default task is return $0;
. This means that if you provide a condition but not a task, in ExAWK, the default is to return the line as it is.
if => 'm/li/' # returns each line that contains 'li' in it.
If you want to print
instead and not record anything, you need to specify that:
if => 'm/li/', do => 'print', dont_record => 1
In ExAWK, the Perl in-built variable $_
is set to the current line. So any in-built functions that take a missing parameter to be $_
will behave accordingly (this is how if => 'm/li/'
and do => 'print'
happen to work).
Field identifiers
AWK is very popular for its intuitive field identifiers $1
, $2
, $3
etc. ExAWK provides the same and much more.
It is important to note that $1
, $2
, etc., are not variables, even in AWK. They are just positional field identifiers. They represent an Rvalue and cannot be modified. So for example
$ awk '// {$1 = "something";}' file.txt
will not accomplish anything. The first field in each line remains what it is.
Similarly, ExAWK identifiers $1
, $2
etc., are also not variables. In particular they are not the same as the native Perl regular expression field identifiers $1
, $2
etc., which are used in regexp substitutions.
The positional field identifiers $1
, $2
etc. have special meaning inside the string expressions of ExAWK. Like AWK, $1
represents the first field, $2
represents the second field, and so on. Like AWK, $0
identifies the whole line.
Reverse field identifiers
Now we add new features that really go beyond AWK. To access fields from the end of the line, use identifiers ${-1}
, ${-2}
, etc. ${-1}
is the last field, ${-2}
is the penultimate field, and so forth.
Automatic checks for NF
You don't need to bother about the existence of a field when you write these expressions. For example, in AWK if you write:
$ awk '$4 == ""' text.txt
then all lines with 3 or less fields will automatically be printed to the screen because $4
evaluates to empty string when there are less than 4 fields on a line. But in ExAWK:
if => '$4 eq ""'
would never be true. (Why?) Now, if you had written a rule like this in AWK:
$ awk '$1 == "MIDDLE" && $2 == "NAME:" {print toupper($3)}' file.txt
you might get a lot of empty lines for each person that has no middle name.
Instead, in ExAWK, the following rule:
if => '$1 eq "MIDDLE" and $2 eq "NAME:"', do => 'return uc($3)'
would automatically check that there are at least three fields on the line. This means it will never return anything in case of people with no middle names. This ensures you don't run into undef
.
The $this
variable
Sometimes you want to access specific attributes of your parser class, or maybe you want to call a method. The $this
variable is accessible in both the condition
and the task
strings.
Important Note: This is a real variable. If you modify the value of $this
, it will change. So don't assign to the variable $this
. If you save the $this
to another variable in the hope that you can retrieve it later, remember that all positional field indicators and range shortcuts are entirely dependent on the this
variable. If this variable is tampered with, you could get garbage results. You have been forewarned.
Local variables
You can use any Perl local variables you want. For example:
do => 'my (@numbers) = ${3+}; # do something with @numbers'
Note that @numbers
above is accessible only within that rule task. It is not accessible outside of that do
string.
Use any variable other than $this
.
Shared variables
If you want to create variables that are initialized or assigned in one rule, but accessed in another rule, you need to use a "shared variable". A shared variable can be a scalar, or a hash reference, or an array reference. It cannot be a hash or an array itself. All shared variables must begin with the tilde (~
) character, whether scalar, arrayref, or hashref. And all of them must begin with an alphabet or underscore (_
).
if => '$1 eq "MARKER:"', do => '~info = $2;'
In the above rule, ~info
is a shared variable, and will be accessible in other rules.
All shared variables created during the parsing of a text input exist only for the duration of the read
method call. They are not accessible outside the Text::Parser
class.
Suite of string and array utility functions
Perl anyway has more built-in functions that are very useful and better than their AWK counterparts. But in addition, CPAN has a lot of great modules with utility functions. ExAWK gives the programmer adds a few good utility functions, but also makes it very easy to add any other functions:
Utility functions added
Scalar::Util :
blessed
,looks_like_number
String::Util : All functions here
I have kept this list small to minimize Text::Parser
dependencies. The user can import whatever functions they want from the package of their choice.
How to add other utility functions
Suppose you know of a very useful package (fictitiously named) Useful::Package
. And let's say it has functions foo
and bar
that are very useful and operate on strings. And you wish to use these in your rules. Then do the following in your code:
use Import::Into;
Useful::Package->import::into('Text::Parser::Rule', qw(foo bar));
use Text::Parser;
my $parser = Text::Parser->new();
$parser->add_rule(if => 'bar($1)', do => 'return foo($2);');
This means that the power of any new package on CPAN can be harnessed very easily.
COMPLEX CONDITIONAL TREES
If you wanted to build a complex if-elsif-else
tree of conditions, in AWK, you need to write them inside one rule like this:
// {
if (condition1) {
task1;
} elsif(condition2) {
task2;
} else {
task3;
}
if (condition3) {
task4;
}
if(condition4) {
task5;
} else {
task6;
}
}
With the pair of options dont_record
and continue_to_next
one can build rules that replace any number of complex set of cascaded if-elsif-else
blocks while still retaining most of them in an elegant single-line form.
if => 'condition1', do => 'task1';
if => 'condition2', do => 'task2';
if => 1, do => 'task3', dont_record => 1, continue_to_next => 1;
if => 'condition3', do => 'task4', dont_record => 1, conditnue_to_next => 1;
if => 'condition4', do => 'task5';
if => 1, do => 'task6';
Not only are these rules compact, it is also possible to understand the execution flow.
SUMMARY
AWK cannot store very complex data structures. ExAWK can.
In UNIX implementation of AWK, the positional variables are limited to
$9
. In POSIX implementations this limitation has already been removed. We also remove this limit.In AWK there are no positional variables for positions counted from the end. In ExAWK, you have
${-1}
,${-2}
,${-3}
etc.In AWK if you use a positional variable like
$8
when there are only 7 fields on a line, it evaluates to empty string. In ExAWK, if you use$8
in any of the strings, an automatic pre-condition is generated to check that there must be at least 8 fields on the input line.
BUGS
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.