NAME

Data::TableReader - Extract records from "dirty" tabular data sources

VERSION

version 0.001

SYNOPSIS

my $ex= Data::TableReader->new(
  # path or file handle
  # let it auto-detect the format (but can override that if we need)
  input => 'path/to/file.csv',
  
  # We want these fields to exist in the file (identified by headers)
  fields => [
    { name => 'address', header => qw/street|address/i },
    'city',
    'state',
    # can validate with Type::Tiny classes
    { name => 'zip', header => qw/zip\b|postal/i, type => US_Zipcode },
  ],
  
  # could do this after extraction, but this fixes it before the type validation
  filters => [
     # they keep losing the leading zeroes on our zip codes. grr.
     sub {
       $_[0]{zip} =~ s/^(\d+)(-(\d+))?$/sprintf("%05d-%04d", $1, $3||0)/e;
       return $_[0];
     },
  ],
  
  # Our data provider is horrible; just ignore any nonsense we encounter
  on_blank_row => 'next',
  on_validation_fail => 'next',
  
  # Capture warnings and show to user who uploaded file
  log => \(my @messages)
);

my $records= $ex->iterator->all;
...
$http_response->body( encode_json({ messages => \@messages }) );

DESCRIPTION

This module is designed to be a useful for anyone who needs to take "loose" or "dirty" tabular data sources (such as Excel, CSV, TSV, or HTML) which may have been edited by non-technical humans and extract the data into sanitized records, while also verifying that the data file contains roughly the schema you were expecting. It is primarily intended for making automated imports of data from non-automated or unstable sources, and providing human-readable feedback about the validity of the data file.

ATTRIBUTES

input

This can be a file name or Path::Class instance or file handle. If a file handle, it must be seekable in order to auto-detect the file format, or you may specify the decoder directly to avoid auto-detection.

decoder

This is either an instance of Data::TableReader::Decoder, or a class name, or a partial class name to be appended as "Data::TableReader::Decoder::$name" or an arrayref or hashref of arguments to build the decoder.

Examples:

'CSV'
# becomes Data::TableReader::Decoder::CSV->new()

[ 'CSV', sep_char => "|" ]
# becomes Data::TableReader::Decoder::CSV->new(sep_char => "|")

{ CLASS => 'CSV', sep_char => "|" }
# becomes Data::TableReader::Decoder::CSV->new({ sep_char => "|" })

fields

An arrayref of Data::TableReader::Field objects (or hashrefs to construct them with) which this module should search for within the tables (worksheets etc.) of "input".

record_class

Default is the special value 'HASH' for un-blessed hashref records. The special value 'ARRAY' will result in arrayrefs with fields in the same order they were specified in the "fields" specification. Setting it to anything else will return records created with $record_class->new(\%fields);

filters

List of filters which should be applied to the data records after they have been pulled from the decoder but before they have been type-checked or passed to the record constructor (if any). Each element of the list should be a coderef which receives a hashref and returns it, possibly modified.

static_field_order

Boolean, whether the "fields" must be found in columns in the exact order that they were specified. Default is false.

header_row_at

Row number, or range of row numbers where the header must be found. (All row numbers in this module are 1-based, to match end-user expectations.) The default is [1,10] to limit header scanning to the first 10 rows. As a special case, if you are reading a source which lacks headers and you trust the source to deliver the columns in the right order, you can set this to undef if you also set static_field_order => 1.

on_unknown_columns

on_unknown_columns => 'use'  # warn, and then use the table
on_unknown_columns => 'next' # warn, and then look for another table which matches
on_unknown_columns => 'die'  # fatal error
on_unknown_columns => sub {
  my ($reader, $col_headers)= @_;
  ...;
  return $opt; # one of the above values
}

This determines handling for columns that aren't associated with any field. The "required" columns must all be found before it considers this setting, but once it has found everything it needs to make this a candidate, you might or might not care about the leftover columns.

'use' (default)

You don't care if there are extra columns, just log warnings about them and proceed extracting from this table.

'next'

Extra columns mean that you didn't find the table you wanted. Log the near-miss, and keep searching additional rows or additional tables.

'die'

This header is probably what you want, but you consider extra columns to be an error condition. Logs the details and calls croak.

sub {}

You can add your own logic to handle this. Inspect the headers however you like, and then return one of the above values.

If you want to get cleaner default log messages, i.e. to show to users, see "log".

on_blank_rows

on_blank_rows => 'next' # warn, and then skip the row(s)
on_blank_rows => 'last' # warn, and stop iterating the table
on_blank_rows => 'die'  # fatal error
on_blank_rows => 'use'  # actually try to return the blank rows as records
on_blank_rows => sub {
  my ($reader, $first_blank_rownum, $last_blank_rownum)= @_;
  ...;
  return $opt; # one of the above values
}

This determines what happens when you've found the table, are extracting records, and encounter a series of blank rows (defined as a row with no printable characters in any field) followed by non-blank rows. If you use the callback, it suppresses the default warning, since you can generate your own. If you want to get cleaner log messages, i.e. to show to users, see "log".

The default is 'next'.

on_validation_fail

on_validation_fail => 'next'  # warn, and then skip the record
on_validation_fail => 'use'   # warn, and then use the record anyway
on_validation_fail => 'die'   # fatal error
on_validation_fail => sub {
  my ($reader, $failures, $values, $context)= @_;
  for (@$failures) {
    my ($field, $value_index, $message)= @$_;
    ...
    # $field is a Data::TableReader::Field
    # $values->[$value_index] is the string that failed validation
    # $message is the error returned from the validation function
    # $context is a string describing the source of the row, like "Row 5"
    # You may modify $values to alter the record that is about to be created
  }
  # Clear the failures array to suppress warnings, if you actually corrected
  # the validation problems.
  @$failures= () if $opt eq 'use';
  # return one of the above constants to tell the iterator what to do next
  return $opt;
}

This determines what happens when you've found the table, are extracting records, and one row fails its validation. In addition to deciding an option, the callback gives you a chance to alter the record before 'use'ing it. If you use the callback, it suppresses the default warning, since you can generate your own. If you want to get cleaner log messages, i.e. to show to users, see "log".

The default is 'die'.

log

If undefined (the default) all log messages above 'info' will be emitted with warn "$message\n". If set to an object, it should support an API of:

trace,  is_trace
debug,  is_debug
info,   is_info
warn,   is_warn
error,  is_error

such as Log::Any and may other perl logging modules use. You can also set it to a coderef such as:

my @messages;
sub { my ($level, $message)= @_;
  push @messages, [ $level, $message ]
    if grep { $level eq $_ } qw( info warn error );
};

for a simple way to capture the messages without involving a logging module. And for extra convenience, you can set it to an arrayref which will receive any message that would otherwise have gone to 'warn' or 'error'.

METHODS

detect_input_format

This is used internally to detect the format of a file, but you can call it manually if you like. The first argument (optional) is a file name, and the second argument (also optional) is the first few hundred bytes of the file. Missing arguments will be pulled from "input" if possible. The return value is the best guess of module name and constructor arguments that should be used to parse the file. However, this doesn't guarantee such module actually exists or is installed; it might just echo the file extension back to you.

find_table

Search through the input for the beginning of the records, identified by a header row matching the various constraints defined in "fields". If "header_row_at" is undef, then this does nothing and assumes success.

Returns a boolean of whether it succeeded. This method does not croak on failure like "iterator" does, on the assumption that you want to handle them gracefully. All diagnostics about the search are logged via "log".

col_map

This is a lazy attribute from table detection. After ccalling "find_table" you can inspect which fields were found for each column via this method. If called before find_table, this triggers it and throws an exception if one isn't found.

Returns an arrayref with one element for each column, each undefined or a reference to the Field object it matched.

field_map

This is another lazy attribute from table detection, mapping from field name to an array of column indicies which the field will be loaded from. If called before find_table, this triggers it and throws an exception if one isn't found.

iterator

Create an iterator. If the table has not been located, then find it and croak if it can't be found. Dependin on the decoder and input filehandle, you might only be able to have one instance of the iterator at a time.

AUTHOR

Michael Conrad <mike@nrdvana.net>

COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by Michael Conrad.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.