NAME
Table::BoxFormat - Parsing the tabular data format generated by database SELECTs
VERSION
Version 0.01
SYNOPSIS
use Table::BoxFormat;
# Reading input from a "dbox" temp file
my $dbx = Table::BoxFormat->new( input_file => '/tmp/select_result.dbox' );
my $data = $self->data; # array of arrays, header in first row
# Input dbox from a string
my $dbx = Table::BoxFormat->new( input_data => $dboxes_string );
my $data = $self->data; # array of arrays, header in first row
# input from dbox file, output directly to a tsv file
my $dbx = Table::BoxFormat->new();
$dbx->output_to_tsv( '/tmp/select_result.dbox', '/tmp/select_result.tsv' );
# input dbox from a string, output directly to a tsv file
$dbx = Table::BoxFormat->new( input_data => $dbox_string );
$dbx->output_to_tsv( $output_tsv_file );
DESCRIPTION
Table::BoxFormat is a module to work with data in the tabular text format(s) commonly used in database client shells (postgresql's "psql", mysql's "mysql", or sqlite's "sqlite3"), where a SELECT will typical display data in a form such as this (mysql):
+-----+------------+---------------+-------------+
| id | date | type | amount |
+-----+------------+---------------+-------------+
| 11 | 2010-09-01 | factory | 146035.00 |
| 15 | 2011-01-01 | factory | 191239.00 |
| 16 | 2010-09-01 | marketing | 467087.00 |
| 17 | 2010-10-01 | marketing | 409430.00 |
+-----+------------+---------------+-------------+
Or this (postgresql's "ascii" form):
id | date | type | amount
----+------------+-----------+--------
1 | 2010-09-01 | factory | 146035
4 | 2011-01-01 | factory | 191239
6 | 2010-09-01 | marketing | 467087
7 | 2010-10-01 | marketing | 409430
These formats are human-readable, but not suitable for other purposes such as feeding to a graphics program, or inserting into another database table.
This code presumes these text tables of "data boxes" are either stored in a string or saved to a file.
This code works with at least three different formats: mysql, psql and unicode psql.
implementation notes
The main method here is read_dbox, which works by first looking for a horizontal ruler line near the top of the data, for example:
+-----+------------+---------------+-------------+
----+------------+-----------+--------
────┼────────────┼───────────┼────────
These ruler lines are used to identify the boundary columns, afterwhich the header and data lines are treated as fixed-width fields. Leading and trailing whitespace are stripped from each value.
An earlier (now deprecated) method named read_simple takes an opposite approach, ignoring the horizontal rules entirely and doing regular expression matches looking for data delimiters on each line. In comparison, the read_dbox should run faster and be able to handle strings with delimiter characters embedded in them.
METHODS
- new
-
Creates a new Table::BoxFormat object.
Takes a list of attribute/setting pairs as an argument.
- input_encoding
-
Default's to "UTF-8". Change to suit text encoding (e.g. "ISO-8859-1"). Must work as a perl ":encoding(...)" layer.
- output_encoding
-
Like input_encoding. Default: "UTF-8".
- input_file
-
File to input data from. Can be supplied later, e.g. when read_dbox is called. Only required if input_data was not defined directly. (( TODO change this: make it required ? ))
- input_data
-
SQL SELECT output in the fixed-width-plus-delimiter form discussed above.
- the parsing regular expressions (type: RegexpRef)
-
- separator_rule
-
The column separators (vertical bar)
- ruler_line_rule
-
Matches the Horizontal ruler lines (typically just under the header line)
- cross_rule
-
Match cross marks the horizontal bars typically use to mark column boundaries.
- left_edge_rule
-
Left border delimiters (we strip these before processing).
- right_edge_rule
-
Right border delimiters (we strip these before processing).
- slurp_input_data
-
Example usage:
$self->slurp_input_data( $input_file_name );
- read_dbox
-
Given data in tabular boxes from a multiline string, convert it into an array of arrays.
my $data = $bxs->read_dbox();
Converts the boxdata from the object's input_data into an array of arrays, with the field names included in the first row.
As a side-effect, copies the header (first row of returned data) in the object's header, and puts some format metadata in the object's meta.
- analyze_ruler
-
Internal method that analyzes the given ruler line and location to determine column widths and the dbox format.
Returns an ordered list like so:
format: 'mysql', 'postgres', 'postgres_unicode', 'sqlite' header location: a row number: 0 or 1 first_data: the row number where data begins: 2 or 3 positions: a list of column boundary positions
Example usage:
( $format, $header_loc, $first_data, @pos ) = $self->analyze_ruler( $line, $i );
- read_simple
-
This is DEPRECATED. See read_dbox.
Given data in tabular boxes from a multiline string, convert it into an array of arrays.
my $data = $bxs->read_simple();
Goes through the boxdata slurped into the object field input_data, returns it as an array of arrays, including the field names in the first row.
As a side-effect, stores the header (first row of boxdata) in the object's header.
- output_to_tsv
-
A convenience method that runs read_dbox and writes the data to a tsv file specified by the given argument.
Returns a reference to the data (array of arrays).
Example usage:
$dbx->output_to_tsv( $input_dbox_file, $output_tsv_file );
Or:
$dbx = Table::BoxFormat->new( input_file => $input_dbox_file ); $dbx->output_to_tsv( $output_tsv_file );
Or:
$dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_tsv( $output_tsv_file );
- output_to_csv
-
A convenience method that runs read_dbox and writes the data to a csv file specified by the given argument.
Example usage:
$dbx->output_to_csv( $input_dbox_file, $output_csv_file );
Or:
$dbx = Table::BoxFormat->new( input_file => $input_dbox_file ); $dbx->output_to_csv( $output_csv_file );
Or:
$dbx = Table::BoxFormat->new( input_data => $dbox_string ); $dbx->output_to_csv( $output_csv_file );
AUTHOR
Joseph Brenner, <doom@kzsu.stanford.edu>, 05 Jun 2016
LIMITATIONS
memory limited
As implemented, this presumes the entire data set can be held in memory. Future versions may be more stream-oriented: there's no technical reason this couldn't be done.
what you get is what you get
This code is only guaranteed to cover input formats from mysql, psql and some from sqlite3. It may work with other databases, but hasn't been tested.
At present it is not easily extensible (implementing a plugin system ala DBI/DBD seemed like overkill).
sqlite3
This code does not support the default output from sqlite3, only a variation with these settings:
.header on
.mode column
While sqlite3 is very flexible, unfortunately the default output does not seem very useable:
SELECT * from expensoids;
|2010-09-01|factory|146035.0
|2010-11-01|factory|218866.0
|2011-01-01|factory|191239.0
|2010-10-01|marketing|409430.0
This is separated by the traditional ascii vertical bar, but without the usual bracketing spaces, and without any attempt at using fixed width columns. Somewhat oddly, the left edge has a vertical bar, but the right edge does not, but worse there's no header that provides column labels.
If I were actually working with sqlite a lot I would turn on the header display and switch to fixed-width columns:
.header on
.mode column
That yields output that looks like this:
id date type amount
---------- ---------- ---------- ----------
1 2010-09-01 factory 146035.0
2 2010-10-01 factory 208816.0
3 2010-11-01 factory 218866.0
That's very similar to the psql format using "\pset border 0" (which has one space column breaks instead of two): both are supported by read_dbox using the analyze_ruler routine.
COPYRIGHT AND LICENSE
Copyright (C) 2016 by Joseph Brenner
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.