NAME
DataExtract::FixedWidth - The one stop shop for parsing static column width text tables!
SYNOPSIS
## We assume the columns have no spaces in the header.
my $de = DataExtract::FixedWidth->new({ header_row => $header_row });
## We explicitly tell what column names to pick out of the header.
my $de = DataExtract::FixedWidth->new({
header_row => $header_row
cols => [qw/COL1NAME COL2NAME COL3NAME/, 'COL WITH SPACE IN NAME']
});
## We supply data to heuristically determine header. Here we assume the first
## row is the header (if we need the first row to avoid this possible assumption set
## the header_row to undef. And the result of the heurisitic applied to the first row
## is the columns
my $de = DataExtract::FixedWidth->new({ heuristic => \@datarows });
$de->parse( $data_row );
$de->parse_hash( $data_row );
DESCRIPTION
This module parses any type of fixed width table -- these types of tables are often outputed by ghostscript, printf() displays with string padding (i.e. %-20s %20s etc), and most screen capture mechanisms. This module is using Moose all methods can be specified in the constructor.
In the below example, this module can discern the column names from the header. Or, you can supply them explicitly in the constructor; or, you can supply the rows in an ArrayRef to heuristic and pray for the best luck.
SAMPLE FILE
HEADER: 'COL1NAME COL2NAME COL3NAMEEEEE'
DATA1: 'FOOBARBAZ THIS IS TEXT ANHER COL '
DATA2: 'FOOBAR FOOBAR IS TEXT ANOTHER COL '
After you have constructed, you can ->parse
which will return an ArrayRef $de->parse('FOOBARBAZ THIS IS TEXT ANOTHER COL');
Or, you can use ->parse_hash()
which returns a HashRef of the data indexed by the column header
Constructor
The class constructor -- ->new
-- provides numerious features. Some options it has are:
- heuristics => \@lines
-
This will deduce the unpack format string from data. If you opt to use this method, and need parse_hash, the first row of the heurisitic is assumed to be the header_row. The unpack_string that results for the heuristic is applied to the header_row to determine the columns.
- cols => \@cols
-
This will permit you to explicitly list the columns in the header row. This is especially handy if you have spaces in the column header. This option will make the
header_row
mandatory. - header_row => $string
-
If a
cols
option is not provided the assumption is that there are no spaces in the column header. The module can take care of the rest. The only way this column can be avoided is if we deduce the header from heuristics, or if you explicitly supply the unpack string and only use->parse($line)
. If you are not going to supply a header, and you do not want to waste the first line on a header assumption, set theheader_row => undef
in the constructor.
Methods
An astrisk, (*) in the option means that is the default.
- ->parse( $data_line )
-
Parses the data and returns an ArrayRef
- ->parse_hash( $data_line )
-
Parses the data and returns a HashRef, indexed by the cols (headers)
- ->first_col_zero(1*|0)
-
This option forces the unpack string to make the first column assume the characters to the left of the header column. So, in the below example the first column also includes the first char of the row, even though the word stock begins at the second character.
CHAR NUMBERS: |1|2|3|4|5|6|7|8|9|10 HEADER ROW : | |S|T|O|C|K| |V|I|N
- ->trim_whitespace(*1|0)
-
Trim the whitespace for the elements that ->parse() outputs
- ->fix_overlay(1|0*)
-
Fixes columns that bleed into other columns, move over all non-whitespace characters preceding the first whitespace of the next column.
So if ColumnA as is 'foob' and ColumnB is 'ar Hello world'
* ColumnA becomes 'foobar', and ColumnB becomes 'Hello world'
- ->null_as_undef(1|0*)
-
Simply undef all elements that return
length(element) = 0
, requires->trim_whitespace
- ->skip_header_data(1*|0)
-
Skips duplicate copies of the header_row if found in the data
- ->colchar_map
-
Returns a hash ref that sisplays the results of each column header and the character position the column starts at.
- ->unpack_string
-
Returns the CORE::unpack() template string that will be used internally by ->parse()
AVAILABILITY
CPAN.org
COPYRIGHT & LICENSE
Copyright 2008 Evan, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
AUTHOR
Evan Carroll <me at evancarroll.com>
System Lord of the Internets
BUGS
Please report any bugs or feature requests to bug-dataexract-fixedwidth at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=DataExtract-FixedWidth. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.