NAME
Data::Tubes::Plugin::Parser
DESCRIPTION
This module contains factory functions to generate tubes that ease parsing of input records.
Each of the generated tubes has the following contract:
the input record MUST be a hash reference;
one field in the hash (according to factory argument
input
, set toraw
by default) points to the input text that has to be parsed;one field in the hash (according to factory argument
output
, set tostructured
by default) is set to the output of the parsing operation.
The factory functions below have two names, one starting with parse_
and the other without this prefix. They are perfectly equivalent to each other, whereas the short version can be handier e.g. when using tube
or pipeline
from Data::Tubes.
FUNCTIONS
by_format
my $tube = by_format($format, %args); # OR
my $tube = by_format(%args); # OR
my $tube = by_format(\%args);
parse the input text according to a template format string (passed via factory argument format
or through first unnamed parameter $format
). This string is supposed to be composed of word and non-word sequences, where each word sequence is assumed to be the name of a field, and each non-word sequence is a separator. Example:
$format = 'foo;bar;baz';
is interpreted as follows:
@field_names = ('foo', 'bar', 'baz');
@separators = (';', ';');
Example:
$format = 'foo;bar~~~baz';
is interpreted as follows:
@field_names = ('foo', 'bar', 'baz');
@separators = (';', '~~~');
In the first case, i.e. when all separators are equal to each other, "by_split" will be called, as it is (arguably) slightly more efficient. Otherwise, "by_separators" will be called. Whatever these two factories return will be returned back.
All @field_names
MUST be different from one another.
The following arguments are supported:
allow_missing
-
set to the number of missing trailing elements that you are fine to lose, in case the format is only compound of a single separator and "by_split" is used behind the scenes. This allows you setting an optional catchall trailing parameter to collect whatever you are not really interested into, also allowing for its absence.
As an example, consider the following input lines:
FOO0,BAR0,BAZ0,WHATEVER FOO1,BAR1,BAZ1 FOO2,BAR2,BAZ2,WHAT2,EVER2,
Assuming that you're really interested into the first three parameter, disregarding whatever comes after, you can set the following format:
foo,bar,baz,rest
and also set
allow_missing
to 1, indicating that you can sustain the lack ofrest
(which you really don't care about); format
-
the format to use for splitting the inputs. This parameter is the main one, so it can also be passed as the first, unnamed parameter (see third calling convention);
input
-
name of the input field, defaults to
raw
; name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; trim
-
remove leading and trailing whitespaces from the extracted values;
value
-
set how you are going to accept input values, e.g. escaped or quoted. See "by_separators" for details.
by_regex
my $tube = by_regex($regex, %args); # OR
my $tube = by_regex(%args); # OR
my $tube = by_regex(\%args);
parse the input text based on a regular expression, passed as argument regex
or $regex
as unnamed first parameter. The regular expression is supposed to have named captures, that will eventually be used to populate the rendered output.
The following arguments are supported:
input
-
name of the input field, defaults to
raw
; name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; regex
-
the regular expression to use for splitting the inputs. This is the main argument, and can be passed also as the first unnamed one in the argument list.
by_separators
my $tube = by_separators($separators, %args); # OR
my $tube = by_separators(%args); # OR
my $tube = by_separators(\%args);
parse the input according to a series of separators, that will be applied in sequence. For example, if the list of separators is the following:
@separators = (';', '~~');
the following input:
$text = 'foo;bar~~/baz/';
will be split as:
@split = ('foo', 'bar', '/baz/');
The following arguments are supported:
input
-
name of the input field, defaults to
raw
; keys
-
a reference to an array containing the list of keys to be associated to the values from the split;
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; separators
-
a reference to an array containing the list of separators to be used for splitting the input. This parameter can also be passed as the first, unnamed argument.
Each separator can be:
a sub reference, that is invoked once with a reference to the arguments, and must return either of the following forms;
a regular expression reference, that will be used as-is at the right place;
a plain string, that will be matched verbatim (through a regular expression matching the string after passing it through
CORE::quotemeta
);
trim
-
remove leading and trailing whitespaces from the extracted values. Example:
@seps = qw< : ; , >; $input = ' what : ever ;you,do '; @elements = ('what', 'ever', 'you', 'do');
value
-
this is how you provide a description of what you consider a valid value. It can be multiple things:
a sub reference, that is called and MUST provide back one of the following alternatives;
a regular expression reference, that is used directly;
a plain string, that is turned into an array reference by creating an anonymous array with the string as its only element, then processed as in the following bullet;
an array reference with elements inside, that will be described in the following list.
If you end up with an array reference, each element will be put in a big regular expression that is the
OR
of all elements. Each can be:a regular expression reference, that is fit as-is in the big regular expression;
the string
specials
, that is the same as having put the three stringescaped
,single-quoted
anddouble-quoted
;the string
quoted
, that is the same as having put the three stringsingle-quoted
anddouble-quoted
;the string
single-quoted
(orsingle_quoted
), that allows you to match a string that is delimited by single quotes, with no escaping inside. This is always put at the beginning of the big regular expression (althoughdouble-quoted
strings can be fit before actually);the string
double-quoted
(ordouble_quoted
), that allows you to match a string that is delimited by double quotes, also allowing escaped elements inside (via backslashes). This is always put at the beginning of the big regular expression;the string
escaped
, that allows you to match a non-greedy sequence of escaped characters (via backslash). Ifsingle-quoted
is also specified, single quotes need to be escaped too. Ifdouble-quoted
is also specified, double quotes need to be escaped too. This is always set at the end of the big regular expression (except forwhatever
, that might appear after it);the string
whatever
, that allows you to match a non-greedy sequence of characters, i.e. it is a synonym of regular expression(?ms:.*?)
. If present, it is always set at the end of the big regular expression.
For example, if you want to accept single quoted, double quoted and unquoted strings, you might provide the following:
[qw< single-quoted double-quoted whatever >]
by_split
my $tube = by_split(%args); # OR
my $tube = by_split(\%args); # OR
my $tube = by_split($separator, %args);
split the input according to a separator string, passed either as the first unnamed parameter $separator
or as hash options separator
.
The following arguments are supported:
allow_missing
-
set to the number of missing trailing elements that you are fine to lose, in case you also provide
keys
(see below). This is particularly important when this function is called behind the scenes by "parse_by_format", because that setskeys
.In practice, suppose that you set the following
keys
:[qw< foo bar baz whatever >]
A normal parsing will expect to find at least four elements, so the following input would fail:
FOO,BAR,BAZ
On the other hand, if you set
allow_missing
to 1, you are accepting that there might be a missing value forwhatever
, that will be filled with the undefined value. input
-
name of the input field, defaults to
raw
; keys
-
optional reference to an array containing a list of keys to be associated to the split data. If present, it will be used as such; if absent, a reference to an array will be set as output.
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; separator
-
the separator to be used for
CORE::split
. If it is a code reference, it is invoked once with the provided arguments to get the separator back. After this, it can be either a regular expression, used as-is, or a string that is passed throughCORE::quotemeta
before being used; trim
-
remove leading and trailing whitespaces from the extracted values. As you might expect, if the
separator
is a colon, the following input:$input = ' what : ever :you:do ';
would be split into the following elements:
@elements = ('what', 'ever', 'you', 'do');
by_value_separator
$tube = by_value_separator($separator, %args); # OR
$tube = by_value_separator(%args); # OR
$tube = by_value_separator(\%args);
parse a sequence of value-and-separator. This is a generalization of "by_split", where you can provide a way to specify what you consider valid values, e.g. to allow for escaping or quoting (hence also allowing having the separator inside your values).
CAVEAT: this function uses the regular expression construct (?{...})
internally. While it is supported as of perl 5.10, this has evolved in time, up to perl 5.18 where it was stabilized. In particular, before perl 5.18 it was not possible to use lexical variables in the construct, so for older perls by_value_separator
uses a package variable for collecting values. This should not be a problem, but might be.
Just to make an example, suppose that you are using semicolons as separators. by_value_separator
would allow you to take this:
'some;thing'; what\;ever ; "this;\"goes\";fine"
and turn it into this:
['some;thing', 'what;ever', 'this:"goes";fine']
As noted, it is similar to "by_split"; as a matter of fact, this might be re-implemented (less efficiently) through by_value_separator. Unless there are bugs, of course. Like "by_split", you can provide a separator
parameter (also via the first, unnamed parameter) that can be either a sub reference, a string or a regular expression.
Additionally, you can provide a value
parameter that tells what is considered an acceptable input value. A value can be different things (see below), but it boils down to providing regular expressions, indication of pre-canned matching expressions, or a combination.
When you match values, you can then decode them. For example, if you specify that you want to accept double-quoted strings, it makes sense to remove the quotes and un-escape the remaining sequence before using it. Depending on what you pass as a definition for a valid value
, your decoding approach might vary. Decoding can happen in two ways: either you provide a decode
function that will be applied to each value, or a decode_values
that is applied to the whole values array. You might want to choose the latter for improving performance (1 sub call against N).
Normally, an input would be split and an array reference would populate the output
field (that is, the field indicated by the output
argument). If you would rather get a hash, you can pass keys
to use, in order. If this is the case, you can also accept getting more values than you have keys for with allow_surplus
, or less of them with allow_missing
.
Last, you might want to take advantage of trim
if your values shouldn't have leading/trailing spaces. Be sure to read the fine prints about trimming quoted strings, though.
Accepted arguments are:
allow_missing
allow_surplus
-
these are integer values that set how much less/more values you are willing to admit with respect to the provided
keys
(see below). Hence, they only work whenkeys
is set.By default they are set to 0, meaning that you expect to have exactly the same number of values as there are keys. Allowing missing means that you accept getting less values than there are keys, that will be associated to
undef
. Allowing surplus means that you're willing to ditch that number of exceeding values; input
-
name of the input field, defaults to
raw
; keys
-
an array reference with the keys to be associated (one-by-one, in order) to the extracted values;
name
-
name of the tube, useful for debugging. Defaults to
parse by value and separator
; output
-
name of the output field, defaults to
structured
; separator
-
the separator to be used between two consecutive valid values. It can be one of the following:
a sub reference, that is called with whatever arguments provided (as a hash reference) and MUST return one of the following two alternatives;
a regular expression reference, that will be matched for the separator;
a plain string, that will be matched verbatim.
There is no default, you MUST provide one either as the first, unnamed parameter or as argument
separator
; trim
-
remove leading and trailing whitespaces from the extracted values. This is applied before decoding is applied, which means that leading/trailing whitespaces inside quoted strings will be kept. Defaults to a false value, meaning that no trimming is performed;
value
-
this is how you provide a description of what you consider a valid value. It can be multiple things:
a sub reference, that is called and MUST provide back one of the following alternatives;
a regular expression reference, that is used directly;
a plain string, that is turned into an array reference by creating an anonymous array with the string as its only element, then processed as in the following bullet;
an array reference with elements inside, that will be described in the following list.
If you end up with an array reference, each element will be put in a big regular expression that is the
OR
of all elements. Each can be:a regular expression reference, that is fit as-is in the big regular expression;
the string
specials
, that is the same as having put the three stringescaped
,single-quoted
anddouble-quoted
;the string
quoted
, that is the same as having put the three stringsingle-quoted
anddouble-quoted
;the string
single-quoted
(orsingle_quoted
), that allows you to match a string that is delimited by single quotes, with no escaping inside. This is always put at the beginning of the big regular expression (althoughdouble-quoted
strings can be fit before actually);the string
double-quoted
(ordouble_quoted
), that allows you to match a string that is delimited by double quotes, also allowing escaped elements inside (via backslashes). This is always put at the beginning of the big regular expression;the string
escaped
, that allows you to match a non-greedy sequence of escaped characters (via backslash). Ifsingle-quoted
is also specified, single quotes need to be escaped too. Ifdouble-quoted
is also specified, double quotes need to be escaped too. This is always set at the end of the big regular expression (except forwhatever
, that might appear after it);the string
whatever
, that allows you to match a non-greedy sequence of characters, i.e. it is a synonym of regular expression(?ms:.*?)
. If present, it is always set at the end of the big regular expression.
For example, if you want to accept single quoted, double quoted and unquoted strings, you might provide the following:
[qw< single-quoted double-quoted whatever >]
ghashy
my $tube = ghashy(%args); # OR
my $tube = ghashy(\%args);
parse the input thext as a hash, generalized. The algorithm used is the same as "generalized_hashy" in Data::Tubes::Util. It is a generalization of "hashy" below.
Accepts all arguments as "generalized_hashy" in Data::Tubes::Util, with the same default values except for default_key
that is set to the empty string (as opposed to not being defined). This means that stand-alone values will always be accepted. This setting is in line with "hashy" and has been set for backwards/mutual compatibility.
The following arguements are recognised too:
defaults
-
a hash reference with default values for the output;
input
-
name of the input field, defaults to
raw
; name
-
name of the tube, useful for debugging. Defaults to
parse ghashy
; output
-
name of the output field, defaults to
structured
;
hashy
my $tube = hashy(%args); # OR
my $tube = hashy(\%args);
parse the input text as a hash. The algorithm used is the same as "metadata" in Data::Tubes::Util.
chunks_separator
-
character used to divide chunks in the input, defaults to a space character (ASCII 0x20);
default_key
-
the default key to be used when a key is not present in a chunk, defaults to the empty string;
defaults
-
a hash reference with default values for the output;
input
-
name of the input field, defaults to
raw
; key_value_separator
-
character used to divide the key from the value in a chunk, defaults to the equal sign
=
; name
-
name of the tube, useful for debugging. Defaults to
parse hashy
; output
-
name of the output field, defaults to
structured
;
This tube factory is strict in what accepts as inputs, in that the separators MUST be single characters and there is no escaping mechanism. If you need something more flexible, see "ghashy" above.
parse_by_format
Alias for "by_format".
parse_by_regex
Alias for "by_regex".
parse_by_separators
Alias for "by_separators".
parse_by_split
Alias for "by_split".
parse_by_value_separator
Alias for "by_value_separator".
parse_ghashy
Alias for "ghashy".
parse_hashy
Alias for "hashy".
parse_single
Alias for "single".
single
my $tube = single(%args); # OR
my $tube = single(\%args);
consider the input text as already parsed, and generate as output a hash reference where the text is associated to a key.
input
-
name of the input field, defaults to
raw
; key
-
key to use for associating the input text;
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
;
BUGS AND LIMITATIONS
Report bugs either through RT or GitHub (patches welcome).
AUTHOR
Flavio Poletti <polettix@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2016 by Flavio Poletti <polettix@cpan.org>
This module is free software. You can redistribute it and/or modify it under the terms of the Artistic License 2.0.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.