NAME

DBIx::SQLCrosstab - creates a server-side cross tabulation from a database

SYNOPSIS

use DBIx::SQLCrosstab;
my $dbh=DBI->connect("dbi:driver:database"
    "user","password", {RaiseError=>1})
        or die "error in connection $DBI::errstr\n";

my $params = {
    dbh    => $dbh,
    op     => [ ['SUM', 'salary'] ], 
    from   => 'person INNER JOIN departments USING (dept_id)',
    rows   => [
                { col => 'country'},
              ],
    cols   => [
                {
                   id => 'dept',
                   value =>'department',
                   from =>'departments'
                },
                {
                    id => 'gender', from => 'person'
                }
              ]
};
my $xtab = DBIx::SQLCrosstab->new($params)
    or die "error in creation ($DBIx::SQLCrosstab::errstr)\n";

my $query = $xtab->get_query("#")
    or die "error in query building $DBIx::SQLCrosstab::errstr\n";

# use the query or let the module do the dirty job for you
my $recs = $xtab->get_recs
    or die "error in execution $DBIx::SQLCrosstab::errstr\n";

# do something with records, or use the child class 
# DBIx::SQLCrosstab::Format to produce well 
# formatted HTML or XML output
#

my $xtab = DBIx::SQLCrosstab::Format->new($params)
    or die "error in creation ($DBIx::SQLCrosstab::errstr)\n";
if ($xtab->get_query and $xtab->get_recs) { 
    print $xtab->as_html;
    my $xml_data = $xtab->as_xml;
}

DESCRIPTION

DBIx::SQLCrosstab produces a SQL query to interrogate a database and generate a cross-tabulation report. The amount of parameters needed to achieve the result is kept to a minimum. You need to indicate which columns and rows to cross and from which table(s) they should be taken. Acting on your info, DBIx::SQLCrosstab creates an appropriate query to get the desired result. Compared to spreadsheet based cross-tabulations, DBIx::SQLCrosstab has two distinct advantages, i.e. it keeps the query in the database work space, fully exploiting the engine capabilities, and does not limit the data extraction to one table.

See http://gmax.oltrelinux.com/cgi-bin/xtab.cgi for an interactive example.

Cross tabulation basics

Cross tabulations are statistical reports where the values from one or more given columns are used as column headers, and GROUP functions are applied to retrieve totals that apply to such values.

SELECT
   id, name, gender, dept
FROM
   person
   INNER JOIN depts ON (depts.dept_id=perspn.dept_id)

+----+--------+--------+-------+
| id | name   | gender | dept  |
+----+--------+--------+-------+
|  1 | John   | m      | pers  |
|  2 | Mario  | m      | pers  |
|  7 | Mary   | f      | pers  |
|  8 | Bill   | m      | pers  |
|  3 | Frank  | m      | sales |
|  5 | Susan  | f      | sales |
|  6 | Martin | m      | sales |
|  4 | Otto   | m      | dev   |
|  9 | June   | f      | dev   |
+----+--------+--------+-------+

A simple example will clarify the concept. Given the above raw data, a count of employees by dept and gender would look something like this:

+-------+----+----+-------+
| dept  | m  | f  | total |
+-------+----+----+-------+
| dev   |  1 |  1 |     2 |
| pers  |  3 |  1 |     4 |
| sales |  2 |  1 |     3 |
+-------+----+----+-------+

The query to create this result is

SELECT
    dept,
    COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) as m,
    COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) as f,
    COUNT(*) as total
FROM
    person
    INNER JOIN depts ON (person.dept_id = depts.dept_id)
GROUP BY
    dept

Although this query doesn't look easy, it is actually quite easy to create and the resulting data is straightforward. Creating the query requires advance knowledge of the values for the "gender" column, which can be as easy as m/f or as complex as male/female/unknown/undeclared/ former male/former female/pending (don't blame me. This is a "real" case!). Give the uncertainity, the method to get the column values id to issue a preparatory query

SELECT DISTINCT gender FROM person

Then we can use the resulting values to build the final query

my $query = "SELECT dept \n";
$query .=
        ",COUNT(CASE WHEN gender = '$_' THEN id ELSE NULL END) AS $_ \n"
          for   @$columns;
$query .= ",COUNT(*) as total \n"
          . "FROM person INNER JOIN depts \n"
          . "ON (person.dept_id=depts.dept_id) \n"
          . "GROUP BY dept\n";

If you have to do it once, you can just use the above idiom and you are done. But if you have several cases, and your cross-tab has more than one level, then you could put this module to good use. Notice that, to create this query, you needed also the knowledge of which column to test (gender) and to which column apply the GROUP function (id)

Multi-level cross tabulations

If single-level cross tables haven't worried you, multiple level tables should give you something to think. In addition to everything said before, multiple level crosstabs have:

  - query composition complexity. Each column is the combination
    of several conditions, one for each level;
  - column subtotals, to be inserted after the appropriate section;
  - row subtotals, to be inserted after the relevant rows;
  - explosive increase of column number. For a three-level crosstab
    where each level has three values you get 27 columns. If you
    include sub-totals, your number rises to 36. If you have just a few
    levels with five or six values , you may be counting rows by the 
    hundreds;    
  - visualization problems. While the result set from the DBMS
    is a simple matrix, the conceptual table has a visualization tree
    at the top (for columns) and a visualization tree at the left
    side (for rows).

+----+----+--------------------+--------------------+--+
| A  | B  |        C1          |       C2           |  |  1
|    |    +--------------------+--------------------+  |  
|    |    |   D1   |   D2   |  |   D1   |   D2   |  |  |  2
|    |    +--------+--------|  +--------+--------+  |  |  
|    |    |E1 E2 T |E1 E2 T |T |E1 E2 T |E1 E2 T |T |T |  3
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| A1 | B1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  4
|    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|    | B2 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  5
|    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|    |  T |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  6
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| A2 | B1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  7
|    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|    | B2 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  8
|    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|    |  T |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  9
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| T  | -- |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 10
+----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
 a    b    c  d  e  f  g  h  i  g  k  l  m  n  o  p  q

Some definitions

columns headers               : 1-3
column headers 1st level      : 1
column headers 2nd level      : 2
column headers 3rd level      : 3

row headers                   : a, b
row header 1st level          : a
row header 2nd level          : b

row sub totals                : 6, 9
row total                     : 10
column sub totals             : e, h, i, l, o, p
column total                  : q

Column headers choice strategies

The easiest way of choosing columns is to tell DBIx::SQLCrosstab to get the values from a given column and let it extract them and combine the values from the various levels to make the appropriate conditions. Sometimes this is not desirable. For instance, if your column values come from the main dataset, the one that will be used for the crosstab, you are querying the database twice with two possibly expensive queries. Sometimes you can't help it, but there are a few workarounds.

If uou know the values in advance, you can pass them to the SQLCrosstab object, saving one query. This is when the values come from a constraint, for example. You may have a "grade" column and you know that the values can only be "A" to "F", or a "game result" column with values "1", "2", "x", "suspended", "camncelled". Or you can run the full query once, and when you are satisfied that the column values are the ones you know they should be, you pass the values for the subsequent calls.

The list option is also useful when you only want to make a crosstab for a given set of values, rather than having a mammoth result table with values you don't need.

Hidden JOINs

The normal case, in a well normalized database, should be to get the column values from a lookup table. If such table was created dynamically, so that it only contains values referred from the main table, then there is no problem. If, on the contrary, the lookup table was pre-loaded with all possible values for a given column, you may come out with a huge number of values, of which only a few were used. In this case, you need to run the query with a JOIN to the main table, to exclude the unused values.

When yoo use a lookup table, though, you can optimize the main query, by removing a JOIN that you have already used in the preparatory query. Let's see an example.

You have the table "person", which includes "dept_id", a foreign key referencing the table "depts". If you pass SQLCrosstab a column description including a {value} key, it will get from the lookup table both {id} and {value}, so that, instead of creating a main query with columns like

,COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END) AS 'pers' 

It will create this:

,COUNT(CASE WHEN dept_id = '1' THEN id ELSE NULL END) AS 'pers'

 or (depending on the parameters you passed) this:

,COUNT(CASE WHEN dept_id = '1' THEN id ELSE NULL END) AS fld001 -- pers 

The difference is that in the first case your final query needs a JOIN to depts, while in the second case it won't need it. Therefore the final query, the expensive one, will be much faster. The reasoning is, once you went through a lookup table to get the distinct values, you should not use that table again in the main query.

Class methods

new

Create a new DBIx::SQLCrosstab object.

my $xtab = DBIx::SQLCrosstab->new($params)
    or die "creation error $DBIx::SQLCrosstab::errstr\n"

$params is a hash reference containing at least the following parameters:

dbh     
        either a valid database handler (DBI::db) or a hash reference
        with the appropriate parameters to create one

        dbh =>  {
                    dsn      => "dbi:driver:database",
                    user     => "username",
                    password => "secretpwd",
                    params   => {RaiseError => 1}
                }

op      
        the operation to perform (SUM, COUNT, AVG, MIN, MAX, STD,
        VAR, provided that your DBMS supports them)
        and the column to summarize. It must be an array reference,
        with each item a pair of operation/column.
        E.g.: 
        op => [ [ 'COUNT', 'id'], ['SUM', 'salary'] ],
    
        *** WARNING ***
        Use of this parameter as a scalar is still supported
        but it is DEPRECATED.


op_col  
        The column on which to perform the operation
        *** DEPRECATED ***
        Use {op} as an array reference instead.

from    
        Where to get the data. It could be as short as
        a table name or as long as a comlex FROM statement
        including INNER and OUTER JOINs. The syntax is not 
        checked. It must be accepted by the DBMS you are 
        using underneath.

rows    
        a reference to an array of hashes, each
        defining one header for a crosstab row level.
        The only mandatory key is  
            {col}  identifying the column name
        Optionally, you can add an {alias} key, to be used
        with the AS keyword. 

cols    
        a reference to an array of hashes, each defining
        one header for a cross tab column level.
        Two keys are mandatory
            {id}      the column name
            {from}    where to get it from.
                      If the {group} option is
                      used, the other columns can have
                      a value of "1" instead.

        Optionally, the following keys can be added

            {group}   If this option is set, then all the
                      columns are queried at once, with the
                      {from} statement of the first column
                      definition. 

            {alias}   an alias for the column name. Useful
                      for calculated fields.
            {value}   an additional column, related to {id}
                      whose values you want to use instead of
                      {id} as column headers. See below "The
                      hidden join" for more explanation.
            {col_list}
                     Is a referenece to an array of values
                     that will be used as column headers, instead
                     of querying the database. If you know the
                     values in advance, or if you want to use only
                     a few known ones, then you can specify them
                     in this list. Each element in col_list must be
                     a hash reference with at least an {id} key. A 
                     {value} key is optional.
            {exclude_value}
                     Is a reference to an array of values to exclude
                     from the headers. Unlike the general option
                     "col_exclude", this option will remove all the
                     combinations containing the given values.

            {where}   to limit the column to get
            {orderby} to order the columns in a different
                      way.


The following parameters are optional.

where   
        a WHERE clause that will be added to the resulting query
        to limit the amount of data to fetch.

having  
        Same as WHERE, but applies to the grouped values

add_op  
        either a scalar or an array reference containing one or
        more functions to be used in addition to the main 'op'.
        For example, if 'op' is 'COUNT', you may set add_op to
        ['SUM', 'AVG'] and the crosstab engine will produce 
        a table having the count, sum and average of the 
        value in 'op_col'.
        *** DEPRECATED *** Use {op} as an array reference instead.

title   
        A title for the crosstab. Will be used for HTML and XML
        creation

remove_if_null
remove_if_zero
        Remove from the record set all the columns where all
        values are respectively NULL or zero. Notice that there 
        is a difference between a column with all zeroes and a 
        column with all NULLs.
        All zeroes with a SUM means that all the values were 0,
        while all NULLs means that no records matching the given 
        criteria were met.
        However, it also depends on the DBMS. According to ANSI
        specifications, SUM/AVG(NULL) should return NULL, while
        COUNT(NULL) should return 0. Rather than assuming
        strict compliance, I leave you the choice.

col_exclude
        Is a reference to an array of columns to be excluded from
        the query. The values must be complete column names.
        To know the column names, you can use the "add_real_names"
        option and then the get_query method.        

add_real_names
        Add the real column names as comments to the query text.
        In order to avoid conflicts with the database, the default
        behavior is to create fake column names (fld001, fld002, etc)
        that will  be later replaced. 
        This feature may cause problems with the database engines
        that don't react well to embedded comments.

use_real_names
        use the real column values as column names. This may be a
        problem if the column value contains characters that are not
        allowed in column names. Even though the names are properly
        quoted, it id not 100% safe.

row_total
row_sub_total
        If activated, adds a total row at the end of the result set
        or the total rows at the end of each row level. Your DBMS
        must support the SQL UNION keyword for this option to work.

        ********
        CAVEAT!
        ********

        Be aware that these two options will double the server load
        for each row level beyond 1, plus one additional query for
        the grand total.
        The meaning of this warning is that the query generated
        by DBIx::SQLCrosstab will contain one UNION query with a
        different GROUP BY clause for each row level. The grand
        total is a UNION query without GROUP BY clause. If your
        dataset is several million records large, you may consider
        skipping these options and perform subtotals and grand 
        total in the client.
        For less than one million records, any decent database
        engine should be able to execute the query in an acceptable
        timeframe.

col_total
col_sub_total
        If activated, add a total column at the end of the result set
        or the total columns at the end of each column level. 

RaiseError
        If actviated, makes all errors fatal. Normally, errors are 
        trapped and recorded in $DBIx::SQLCrosstab. RaiseError will 
        raise an exception instead.

PrintError
        If activated, will issue a warning with the message that 
        caused the exception, but won't die.

************************************************
The following options only apply when creating a 
DBIx::SQLCrosstab::Format object.
************************************************

commify
        Used for HTML and XML output. If true, will insert commas
        as thousand separators in all recordset numbers.

complete_html_page
        Returns HTML header and end tags, so that the resulting
        text is a complete HTML page.

only_html_header
        Returns only the header part of the table, without records.
        Useful to create templates.

add_colors
        If true, default colors are applied to the resulting
        table.
        text    => "#009900", # green
        number  => "#FF0000", # red
        header  => "#0000FF", # blue
        footer  => "#000099", # darkblue

text_color
number_color
header_color
footer_color
        Change the default colors to custom ones  

 table_border       
 table_cellspacing 
 table_cellpadding 
        Change the settings for HTML table borders. Defaults are:
        border      => 1
        cellspacing => 0
        cellpadding => 2
set_param

Allows to set one or more parameters that you couldn't pass with the constructor.

$xtab->set_param( cols => [ { id => 'dept', from => 'departments' } ]  )
     or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";

$xtab->set_param( 
                     remove_if_null => 1,
                     remove_if_zero => 1,
                     title          => 'Some nice number crunching'
                 )
     or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";

You can use this method together with a dummy constructor call:

my $xtab = DBIx::SQLCrosstab->new ('STUB')
    or die "can't create ($DBIx::SQLCrosstab::errstr)\n";

$xtab->set_param( 
                  dbh    => $dbh,
                  op     => 'SUM',
                  op_col => 'amount',
                  cols   => $mycolumns,
                  rows   => $myrows,
                  from   => 'mytable'
                  )
    or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";
get_params

Returns a string containing te parameters to replicate the current DBIx::SQLCrosstab object. The data is represented as Perl code, and it can be evaluated as such. The variable's name is 'params'. It does not include the 'dbh' parameter.

my $params = $xtab->get_params
    or warn "can't get params ($DBIx::SQLCrosstab::errstr)";
save_params

Saves the parameters necessary to rebuild the current object to a given file. This function stores what is returned by get_params into a text file. Notice that the 'dbh' option is not saved.

unless ($xtab->save_params('myparams.pl')
    die "can't save current params ($DBIx::SQLCrosstab::errstr)";
load_params

Loads previously saved parameters into the current object. Remember that 'dbh' is not restored, and must be set separately with set_param().

my $xtab = DBIx::SQLCrosstab->new('stub')
    or die "$DBIx::SQLCrosstab::errstr";
$xtab->load_params('myparams.pl')
    or die "$DBIx::SQLCrosstab::errstr";
$xtab->set_param( dbh => $dbh )
    or die "$DBIx::SQLCrosstab::errstr";
get_query

Returns the query to get the final cross-tabulation, or undef in case of errors. Check $DBIx::SQLCrosstab::errstr for the reason. You may optionally pass a parameter for the character to be used as separator between column names. The default is a pound sign ('#'). If the separator character is present in any of the column values (i.e. the values from a candidate column header), the engine will try in sequence '#', '/', '-', '=', doubling them if necessary, and eventually giving up only if all these characters are present in any column values. If this happens, then you need to pass an appropriate character, or group of charecters that you are reasonably sure doesn't recur in column values.

get_recs

Executes the query and returns the recordset as an arrayref of array references, or undef on failure. After this method is called, several attributes become available: - recs the whole recordset - NAME an arrayref with the list of column names - LENGTH an arrayref with the maximum size of each column - NUM_OF_FIELDS an integer with the number of felds in the result set

Class attributes

There are attributes that are available for external consumption. Like the DBI, these attributes become available after a given event.

{NAME}

This attribute returns the raw column names for the recordset as an array reference. Even if {use_real_names} was not defined, this attribute returns the real names rather than fld001, fld002, and so on. It is available after get_recs() was called.

my $fnames = $xtab->{NAMES};
print "$_\n" for @{$fnames};
{recs}

This attribute contains the raw recordset as returned from the database. Available after get_recs().

{NUM_OF_FIELDS}

The number of fields in the recordset. Available after get_recs().

{LENGTH}

Contains an array reference to the maximum lengths of each column. The length is calculated taking into account the length of the column name and the length of all values in that column. Available after get_recs().

DEPENDENCIES

DBI
a DBD driver
Tree::DAG_Node

EXPORT

None by default.

AUTHOR

Giuseppe Maxia (<gmax_at_cpan.org>)

SEE ALSO

DBI

An article at OnLamp, "Generating Database Server-Side Cross Tabulations" (http://www.onlamp.com/pub/a/onlamp/2003/12/04/crosstabs.html) and one at PerlMonks, "SQL Crosstab, a hell of a DBI idiom" (http://www.perlmonks.org/index.pl?node_id=313934).

COPYRIGHT

Copyright 2003 by Giuseppe Maxia (<gmax_at_cpan.org>)

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.