NAME

ETL::Pipeline::Input - Role for ETL::Pipeline input sources

SYNOPSIS

use Moose;
with 'ETL::Pipeline::Input';

sub next_record {
  # Add code to read your data here
  ...
}

DESCRIPTION

ETL::Pipeline reads data from an input source, transforms it, and writes the information to an output destination. This role defines the required methods and attributes for input sources. Every input source must implement ETL::Pipeline::Input.

ETL::Pipeline works by calling the methods defined in this role. The role presents a common interface. It works as a shim, tying file parsing modules with ETL::Pipeline. For example, CSV files are parsed with the Text::CSV module. ETL::Pipeline::Input::DelimitedText wraps around Text::CSV. ETL::Pipeline::Input::DelimitedText implements this role by calling Text::CSV.

Adding a new input source

Out of the box, ETL::Pipeline provides input sources for Microsoft Excel and CSV (comma seperated variable) files. To add your own formats...

1. Create a Perl module. Name it ETL::Pipeline::Input::....
2. Make it a Moose object: use Moose;.
3. Include the role: with 'ETL::Pipeline::Input';.
4. Add the "next_record" method: sub next_record { ... }.
5. Add the "configure" method: sub configure { ... }.
6. Add the "finish" method: sub finish { ... }.

Ta-da! Your input source is ready to use:

$etl->input( 'YourNewSource' );

Does ETL::Pipeline::Input only work with files?

No. ETL::Pipeline::Input works for any source of data, such as SQL queries, CSV files, or network sockets. Write a "next_record" method using whatever method suits your needs.

This documentation refers to files because that is what I use the most. Don't let that fool you! ETL::Pipeline::Input was designed to work seamlessly with files and non-files alike.

METHODS & ATTRIBUTES

pipeline

pipeline returns the ETL::Pipeline object using this input source. You can access information about the pipeline inside the methods.

"input" in ETL::Pipeline automatically sets this attribute.

Arguments for "input" in ETL::Pipeline

debug

While we expect perfect data, things go wrong. debug lets "process" in ETL::Pipeline peek into the raw data one record at a time. I use this when tracking down random problems in the middle of a 3,000 row spread sheet.

"process" in ETL::Pipeline executes this code reference for every record. "process" in ETL::Pipeline ignores the return value.

The code reference receives the current ETL::Pipeline as its first parameter and in $_.

$etl->input( 'UnitTest', debug => sub { print $_->get( 'A' ) } );

filter

filter does extra processing on the file data. The default filter trims leading and trailing whitespace. You can use your own filter to handle special values like "N/A" or "NULL".

Assign a code reference to filter. Unlike the other code references, filter does not have access to the ETL::Pipeline object. The filter receives two array references as parameters. The first array holds the values for filtering. The second array holds the arguments passed to "get".

The filter returns a list of filtered values. The results should be in the same order as the values found in the input.

$etl->input( 'UnitTest', filter => sub {
  my ($values, $arguments) = @_;
  map { $_ eq 'NA' ? '' : $_ } @$values;
} );

skip_if

skip_if accepts a code reference. "process" in ETL::Pipeline executes this code for every input record. If this code returns false, "process" in ETL::Pipeline discards the record with no further processing.

Use skip_if to bypass bad data.

The code reference receives the current ETL::Pipeline as its first parameter and in $_.

Note: skip_if only works on data records. It is not applied to column headers.

$etl->input( 'UnitTest', skip_if => sub { $_->get( 'A' ) eq 'DELETED' } );

stop_if

Normally, ETL::Pipeline goes until the end of the file. This code reference stops processing early. If the code reference returns true, ETL::Pipeline shuts down, just as if it reached the end of the file.

I use this with report formats that have grand totals at the end. The totals aren't real data.

The code reference receives the current ETL::Pipeline as its first parameter and in $_.

$etl->input( 'UnitTest', stop_if => sub { $_->get( 'A' ) eq 'Totals' } );

Called from "process" in ETL::Pipeline

next_record

next_record reads the next single record from the input source. "process" in ETL::Pipeline calls this method inside of a loop. next_record returns a boolean flag. A true value means success getting the record. A false value indicates the end of the input - no more records.

The implmenting class must define this method.

while ($input->next_record) {
  ...
}

get

get returns a list of values from matching fields from the current record. ETL::Pipeline::Input does not define how "next_record" stores its data internally. You should use the format that best suits your needs. For example, ETL::Pipeline::Input::Excel uses an Spreadsheet::XLSX object. It's get accesses object methods to retrieve fields.

"process" in ETL::Pipeline passes in the value from "mapping" in ETL::Pipeline. That can be a scalar value (string), regular expression, or array reference. get returns a list of values from matching fields. "process" in ETL::Pipeline passes that list directly to "set" in ETL::Pipeline::Output.

Note: ETL::Pipeline::Input automatically passes the return values through "filter". You should not call "filter" from inside of the get method.

The implmenting class must define this method.

# Retrieve one field named 'A'.
$etl->get( 'A' );

# Retrieve the field from the column 'ID Num'.
$etl->get( qr/id\s*num/i );

# A list is used to build composite field names.
$etl->get( '/root', '/first' );

NOTE: get returns a list - not an individual value. Even if only one field matches, get still returns a list. Calling it in scalar context returns the number of elements in the list - not a value. Keep this in mind when calling get from "stop_if" or "skip_if".

configure

configure prepares the input source. It can open files, make database connections, or anything else required before reading the first record.

Why not do this in the class constructor? Some roles add automatic configuration. Those roles use the usual Moose method modifiers, which would not work with the constructor.

This configure - for the input source - is called before the "configure" in ETL::Pipeline::Output of the output destination. This method should not rely on the configuration of the output destination.

The implmenting class must define this method.

$input->configure;

finish

finish shuts down the input source. It can close files, disconnect from the database, or anything else required to cleanly terminate the input.

Why not do this in the class destructor? Some roles add automatic functionality via Moose method modifiers. This would not work with a destructor.

This finish - for the input source - is called after the "finish" in ETL::Pipeline::Output of the output destination. This method should not rely on the configuration of the output destination.

The implmenting class must define this method.

$input->finish;

Other Methods & Attributes

record_number

The record_number attribute tells you how many total records have been read by "next_record". The count includes headers and "skip_if" records.

The first record is always 1.

ETL::Pipeline::Input automatically increments the counter after "next_record". The "next_record" method should not change record_number.

decrement_record_number

This method decreases "record_number" by one. It can be used to back out header records from the count.

$input->decrement_record_number;

increment_record_number

This method increases "record_number" by one.

$input->increment_record_number;

SEE ALSO

ETL::Pipeline, ETL::Pipeline::Output

AUTHOR

Robert Wohlfarth <robert.j.wohlfarth@vanderbilt.edu>

LICENSE

Copyright 2016 (c) Vanderbilt University Medical Center

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.