NAME
Data::Tubes::Plugin::Parser
DESCRIPTION
This module contains factory functions to generate tubes that ease parsing of input records.
Each of the generated tubes has the following contract:
the input record MUST be a hash reference;
one field in the hash (according to factory argument
input
, set toraw
by default) points to the input text that has to be parsed;one field in the hash (according to factory argument
output
, set tostructured
by default) is set to the output of the parsing operation.
The factory functions below have two names, one starting with parse_
and the other without this prefix. They are perfectly equivalent to each other, whereas the short version can be handier e.g. when using tube
or pipeline
from Data::Tubes.
FUNCTIONS
- by_format
- parse_by_format
-
my $tube = by_format(%args); # OR my $tube = by_format(\%args); # OR my $tube = by_format($format, %args);
parse the input text according to a template format string (passed via factory argument
format
or through first unnamed parameter$format
). This string is supposed to be composed of word and non-word sequences, where each word sequence is assumed to be the name of a field, and each non-word sequence is a separator. Example:$format = 'foo;bar;baz';
is interpreted as follows:
@field_names = ('foo', 'bar', 'baz'); @separators = (';', ';');
Example:
$format = 'foo;bar~~~baz';
is interpreted as follows:
@field_names = ('foo', 'bar', 'baz'); @separators = (';', '~~~');
In the first case, i.e. when all separators are equal to each other, "parse_by_split" will be called, as it is (arguably) slightly more efficient. Otherwise, "parse_by_regexes" will be called. Whatever these two factories return will be returned back.
All
@field_names
MUST be different from one another.The following arguments are supported:
allow_missing
-
set to the number of missing trailing elements that you are fine to lose, in case the format is only compound of a single separator and "parse_by_split" is used behind the scenes. This allows you setting an optional catchall trailing parameter to collect whatever you are not really interested into, also allowing for its absence.
As an example, consider the following input lines:
FOO0,BAR0,BAZ0,WHATEVER FOO1,BAR1,BAZ1 FOO2,BAR2,BAZ2,WHAT2,EVER2,
Assuming that you're really interested into the first three parameter, disregarding whatever comes after, you can set the following format:
foo,bar,baz,rest
and also set
allow_missing
to 1, indicating that you can sustain the lack ofrest
(which you really don't care about); format
-
the format to use for splitting the inputs. This parameter is the main one, so it can also be passed as the first, unnamed parameter (see third calling convention);
input
-
name of the input field, defaults to
raw
; name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
;
- by_regex
- parse_by_regex
-
my $tube = by_regex(%args); # OR my $tube = by_regex(\%args); # OR my $tube = by_regex($regex, %args);
parse the input text based on a regular expression, passed as argument
regex
or$regex
as unnamed first parameter. The regular expression is supposed to have named captures, that will eventually be used to populate the rendered output.The following arguments are supported:
input
-
name of the input field, defaults to
raw
; name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; regex
-
the regular expression to use for splitting the inputs. This is the main argument, and can be passed also as the first unnamed one in the argument list.
- by_separators
- parse_by_separators
-
my $tube = by_separators(%args); # OR my $tube = by_separators(\%args); # OR
parse the input according to a series of separators, that will be applied in sequence. For example, if the list of separators is the following:
@separators = (';', '~~');
the following input:
$text = 'foo;bar~~/baz/';
will be split as:
@split = ('foo', 'bar', '/baz/');
The following arguments are supported:
input
-
name of the input field, defaults to
raw
; keys
-
a reference to an array containing the list of keys to be associated to the values from the split;
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; separators
-
a reference to an array containing the list of separators to be used for splitting the input.
- by_split
- parse_by_split
-
my $tube = by_split(%args); # OR my $tube = by_split(\%args); # OR my $tube = by_split($separator, %args);
split the input according to a separator string, passed either as the first unnamed parameter
$separator
or as hash optionsseparator
.The following arguments are supported:
allow_missing
-
set to the number of missing trailing elements that you are fine to lose, in case you also provide
keys
(see below). This is particularly important when this function is called behind the scenes by "parse_by_format", because that setskeys
.In practice, suppose that you set the following
keys
:[qw< foo bar baz whatever >]
A normal parsing will expect to find at least four elements, so the following input would fail:
FOO,BAR,BAZ
On the other hand, if you set
allow_missing
to 1, you are accepting that there might be a missing value forwhatever
, that will be filled with the undefined value. input
-
name of the input field, defaults to
raw
; keys
-
optional reference to an array containing a list of keys to be associated to the split data. If present, it will be used as such; if absent, a reference to an array will be set as output.
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
; separator
-
a reference to an array containing the list of separators to be used for splitting the input. This is the main argument, and can be passed also as the first unnamed one in the argument list.
- hashy
- parse_hashy
-
my $tube = hashy(%args); # OR my $tube = hashy(\%args);
parse the input text as a hash. The algorithm used is the same as
metadata
in Data::Tubes::Util.chunks_separator
-
character used to divide chunks in the input;
default_key
-
the default key to be used when a key is not present in a chunk;
input
-
name of the input field, defaults to
raw
; key_value_separator
-
character used to divide the key from the value in a chunk;
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
;
- single
- parse_single
-
my $tube = single(%args); # OR my $tube = single(\%args);
consider the input text as already parsed, and generate as output a hash reference where the text is associated to a key.
input
-
name of the input field, defaults to
raw
; key
-
key to use for associating the input text;
name
-
name of the tube, useful for debugging;
output
-
name of the output field, defaults to
structured
;
BUGS AND LIMITATIONS
Report bugs either through RT or GitHub (patches welcome).
AUTHOR
Flavio Poletti <polettix@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2016 by Flavio Poletti <polettix@cpan.org>
This module is free software. You can redistribute it and/or modify it under the terms of the Artistic License 2.0.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.