NAME
DBIx::DataAudit - summarize column data for a table
SYNOPSIS
use DBIx::DataAudit;
warn "Running audit for table $table";
my $audit = DBIx::DataAudit->audit( dsn => 'dbi:SQLite:dbname=test.sqlite', table => 'test' );
print $audit->as_text;
# or
print $audit->as_html;
This module provides a summary about the data contained in a table. It provides the descriptive statistics for every column. It's surprising how much bad data you find by looking at the minimum and maximum values of a column alone.
It tries to get the information in one table scan.
HOW IT WORKS
The module works by constructing an SQL statement that collects the information about the columns in a single full table scan.
COLUMN TRAITS
You can specify which information is collected about every column by specifying the traits. The hierarchy of traits is as follows:
any < ordered < numeric
< string
The following traits are collected for every column by default:
total
Number of rows in the column
values
Number of distinct values in the column
null
Number of
NULL
values for the column
For columns that are recognized as ordered, the following additional traits are collected:
min
Minimum value for the column
max
Maximum value for the column
For columns that are recognized as numeric, the following additional traits are collected:
avg
Average value for the column
For columns that are recognized as string, the following additional traits are collected:
blank
Number of values that consist only of blanks (
chr 32
)empty
Number of values that consist only of the empty string (
''
)missing
Number of values that consist only of the empty string (
''
), are blank (chr 32
) or areNULL
GLOBAL VARIABLES
To customize some default behaviour, the some global variables are defined. Read the source to find their names.
METHODS
The class implements the following methods:
__PACKAGE__->audit ARGS
Performs the data audit. Valid arguments are:
table
Name of the table to audit. No default.
traits
Array reference to the traits. Default traits are
min max count null avg blank empty missing
columns
Names of the columns to audit. Default are all columns of the table.
dbh
Database handle. If missing, hopefully you have specified the
dsn
.dsn
DSN to use. Can be omitted if you pass in a valid
dbh
instead.column_info
Column information, in the same format as the DBI returns it. By default, this will be read in via DBI.
$audit->as_text RESULTS
Returns a table drawn as text with the results.
$audit->as_html RESULTS, TEMPLATE
Returns a HTML page with the results.
You can pass in a custom resultset or undef
if you want the module to determine the results.
You can pass in a custom (Template) template if you want fancier rendering.
$audit->template_data
Returns a hash with the following three keys, suitable for using with whatever templating system you have:
table
- the name of the tableheadings
- the headings of the columnsrows
- the values of the traits of every column
$audit->run_audit
Actually runs the SQL in the database.
$audit->column_type COLUMN
Returns the type for the column. The four valid types are any
, ordered
, numeric
and string
.
$audit->get_columns TABLE
Returns the names of the columns for the table TABLE
. By default, the value of TABLE
will be taken from the value passed to the constructor audit
.
$audit->collect_column_info TABLE
Collects the information about the columns for the table TABLE
from the DBI. By default, TABLE
will be taken from the value passed to the constructor audit
.
If your database driver does not implement the ->column_info
method you are out of luck. A fatal error is raised by this method if ->column_info
does not return anything.
This method will raise warnings if it encounters a data type that it doesn't know yet. You can either patch the global variable %sql_type_map
to add the type or submit a patch to me to add the type and its interpretation.
$audit->get_sql TABLE
Creates the SQL statement to collect the information. The default value for TABLE
will be the table passed to the constructor audit
.
If you encounter errors from your SQL engine, you may want to print the result of this method out.
$audit->trait_applies TRAIT, COLUMN
Checks whether a trait applies to a column.
A trait applies to a column if the trait type is any
or if it is the same type as the column type as returned by get_column_type
.
The method will raise an error if it is passed an unknown trait name. See the source code for how to add custom traits.
COMMAND LINE USAGE
You can use this mail from the command line if you need a quick check of data:
perl -MDBIx::DataAudit=dbi:SQLite:dbname=some/db.sqlite my_table [traits]
This could also incredibly useful if you want a breakdown of a csv-file:
perl -MDBIx::DataAudit=dbi:AnyData:dbname=some/db.sqlite my_table [traits]
Unfortunately, that does not work yet, as I haven't found a convenient oneliner way to make a CSV file appear as database.
PLANNED FEATURES
Show the value distribution per column. This will mean running an SQL statement per column that does another full table scan, or at least a full index scan, unless somebody tells me how to do such without a
GROUP BY
clause.Fancy HTML bar charts showing the value distribution
AUTHOR
Max Maischein corion@cpan.org
COPYRIGHT (c)
Copyright 2008-2009 by Max Maischein corion@cpan.org
.
LICENSE
This module is released under the same terms as Perl itself.