NAME

DBD::RAM - a DBI driver for files and data structures

SYNOPSIS

use DBI;
my $dbh = DBI->connect('DBI:RAM:','usr','pwd',{RaiseError=>1});
$dbh->func({
   table_name  => 'my_phrases',
   col_names   => 'id,phrase',
   data_type   => 'PIPE',
   data_source => [<DATA>],
}, 'import' );
print $dbh->selectcol_arrayref(qq[
  SELECT phrase FROM my_phrases WHERE id = 1
])->[0];
__END__
1 | Hello, New World
2 | Some other Phrase

This sample creates a database table from data, uses SQL to make a selection from the database and prints out the results. While this table is in-memory only and uses pipe "delimited" formating, many other options are available including local and remote file access and many different data formats.

DESCRIPTION

DBD::RAM allows you to import almost any type of Perl data structure into an in-memory table and then use DBI and SQL to access and modify it. It also allows direct access to almost any kind of file, supporting SQL manipulation of the file without converting the file out of its native format.

The module allows you to prototype a database without having an rdbms system or other database engine and can operate either with or without creating or reading disk files. If you do use disk files, they may, in most cases, either be local files or any remote file accessible via HTTP or FTP.

OVERVIEW

This modules allows you to work with a variety of data formats and to treat them like standard DBI/SQL accessible databases. Currently supported formats include:

FORMATS:

  XML    Extended Markup Language (XML)
  FIXED  fixed-width records
  INI    name=value pairs
  PIPE   pipe "delimited" text
  TAB    tab "delimited" text
  CSV    Comma Separated Values or other "delimited" text
  MP3    MP3 music binary files
  ARRAY  Perl array
  HASH   Perl associative array
  DBI    DBI database connection
  USR    user defined formats

The data you use may come form several kinds of sources:

SOURCES

  DATA         Perl data structures: strings, arrays, hashes
  LOCAL FILE   a file stored on your local computer hard disk
  REMOTE FILE  a remote file accessible via HTTP or FTP

If you modify the data in a table, the modifications may be stored in several ways. The storage can be temporary, i.e. in memory only with no disk storage. Or several modifications can be done in memory and then stored to disk once at the end of the processing. Or modifications can be stored to disk continuously, similarly to the way other DBDs operate.

STORAGE

  RAM          in-memory processing only, no storage
  ONE-TIME     processed in memory, stored to disk on command
  CONTINUOUS   all modifications stored to disk as they occur

Here is a summary of the SOURCES, FORMATS, and STORAGE capabilities of DBD::RAM. (x = currently supported, - = notsupported, * = support in progress)

                                      FORMAT
                  CSV PIPE TAB FIXED INI XML MP3 ARRAY HASH DBI USR
INPUT
array/hash/string  x    x   x    x    x   x   -    x     x   -   x
local file         x    x   x    x    x   x   x    -     -   x   x
remote file        x    x   x    x    x   x   *    -     -   *   x
OUTPUT
ram table          x    x   x    x    x   x   x    x     x   x   x
file (1-time)      x    x   x    x    x   x   -    -     -   *   *
file (continuous)  x    x   x    x    x   *   -    -     -   x   *

Please note that any ram table, regardless of original source can be stored in any of the supported file output formats. So, for example, a table of MP3 information could be stored as a CSV file, the "-" in the MP3 column only indicates that the information from the MP3 table can not (for obvious reasons) be written back to an MP3 file.

INSTALLATION & PREREQUISITES

This module should work on any any platform that DBI works on.

You don't need an external SQL engine or a running server, or a compiler. All you need are Perl itself and installed versions of DBI and SQL::Statement. If you do not also have DBD::CSV installed you will need to either install it, or simply copy File.pm into your DBD directory.

You can either use the standard makefile method, or just copy RAM.pm into your DBD directory.

Some features require installation of extra modules. If you wish to work with the XML format, you will need to install XML::Parser. If you wish to use the ability to work with remote files, you will need to install the LWP (libnet) modules. Other features of DBD::RAM work fine without these additional modules.

SQL & DBI

This module, like other DBD database drivers, works with the DBI methods listed in DBI.pm and its documentation. Please see the DBI documentation for details of methods such as connecting, preparing, executing, and fetching data. Currently only a limited subset of SQL commands are supported. Here is a brief synopsis, please see the documentation for SQL::Statement for a more comple description of these commands.

     CREATE  TABLE $table 
                   ( $col1 $type1, ..., $colN $typeN,
                   [ PRIMARY KEY ($col1, ... $colM) ] )

      DROP  TABLE  $table

      INSERT  INTO $table 
                   [ ( $col1, ..., $colN ) ]
                   VALUES ( $val1, ... $valN )

      DELETE  FROM $table 
                   [ WHERE $wclause ]

           UPDATE  $table 
                   SET $col1 = $val1, ... $colN = $valN
                   [ WHERE $wclause ]

SELECT  [DISTINCT] $col1, ... $colN 
                   FROM $table
                   [ WHERE $wclause ] 
                   [ ORDER BY $ocol1 [ASC|DESC], ... $ocolM [ASC|DESC] ]

         $wclause  [NOT] $col $op $val|$col
                   [ AND|OR $wclause2 ... AND|OR $wclauseN ]

              $op  = |  <> |  < | > | <= | >= 
                   | IS NULL | IS NOT NULL | LIKE | CLIKE

WORKING WITH FILES & TABLES:

This module supports working with both in-memory and disk-based databases. In order to allow quick testing and prototyping, the default behavior is for tables to be created in-memory from in-memory data but it is easy to change this behavior so that tables can also be created, manipulated, and stored on disk or so that there is a combination of in-memory and disk-based manipulation.

There are three methods unique to the DBD::RAM module to allow you to specify which mode of operation you use for each table or operation:

1) import()  imports data either from memory or from a file into an 
             in-memory table

2) export()  exports data from an in-memory table to a file regardless of
             the original source of the data

3) catalog() sets up an association between a file name and a table name
              such that all operations on the table are done continuously
              on the file

With the import() method, standard DBI/SQL commands like select, update, delete, etc. apply only to the data that is in-memory. If you want to save the modifications to a file, you must explcitly call export() after making the changes.

On the other hand, the catalog() method sets up an association between a file and a tablename such that all DBI/SQL commands operate on the file continuously without needing to explicitly call export(). This method of operation is similar to other DBD drivers.

Here is a rough diagram of how the three methods operate:

disk -> import() -> RAM

                    select
                    update
                    delete
                    insert
                    (multiple times)

disk <- export() <- RAM

catlog()
disk <-> select
disk <-> update
disk <-> delete
disk <-> insert   

Regardless of which method is chosen, the same set of DBI and SQL commands may be applied to all tables.

See below for details of import(), export() and catalog() and for specifics of naming files and directories.

Creating in-memory tables from data and files: import()

In-memory tables may be created using standard CREATE/INSERT statements, or using the DBD::RAM specific import method:

$dbh->func( $args, 'import' );

The $args parameter is a hashref which can contain several keys, most of which are optional and/or which contain common defaults.

These keys can either be specified or left to default behaviour:

table_name   string: name of the table
 col_names   string: column names for the table
 data_type   string: format of the data (e.g. XML, CSV...)

The table_name key to the import() method is either a string, or if it is omitted, a default table name will be automatically supplied, starting at table1, then table2, etc.

   table_name => 'my_test_db',

OR simply omit the table_names key

If the col_names key to the import() method is omitted, the column names will be automatically supplied, starting at col1, then col2, etc. If the col_names key is the string 'first_line', the column names will be taken from the first line of the data. If the col_names key is a comma separated list of column names, those will be taken (in order) as the names of the columns.

    col_names => 'first_line',

OR  col_names => 'name,address,phone',

OR  simply omit the col_names key

If table_name or col_names are specified, they must comply with SQL naming rules for identifiers: start with an alphabetic character; contain nothing but alphabetic characters, numbers, and/or underscores; be less than 128 characters long; not be the same as a SQL reserved keyword. If the table refers to a file that has a period in its name (e.g. my_data.csv), this can be handled with the catalog() method, see below.

The table_name and col_names, if specified, *are* case sensititive, so that "my_test_db" is not the same as "my_TEST_db".

The data_type key to the import() method specifies the format of the data as already described above in the general description. It must be one of:

  data_type => 'CSV',
  data_type => 'TAB',
  data_type => 'PIPE',
  data_type => 'INI',
  data_type => 'FIXED',
  data_type => 'XML',
  data_type => 'MP3',
  data_type => 'DBI',
  data_type => 'USR',
  data_type => 'ARRAY',
  data_type => 'HASH',

OR simply omit the data_type key

If no data_type key is supplied, the default format CSV will be used.

The import() keys must specify a source of the data for the table, which can be any of:

  file_source   string: name of local file to get data from
remote_source   string: url of remote file to get data from
  data_source   string or arrayref: the actual data

The file_source key is the name of local file. It's location will be taken to be relative to the f_dir specified in the database connection, see the connect() method above. Whether or not the file name is case sensitive depends on the operating system the script is running on e.g. on Windows case is ignored and on UNIX it is not ignored. For maximum portability, it is safest to assume that case matters.

file_source => 'my_test_db.csv'

The remote_source key is a URL (Uniform Resource Locator) to a file located on some other computer. It may be any kind of URL that is supported by the LWP module includding http and FTP. If username and password are required, they can be included in the URL.

   remote_source => 'http://myhost.com/mypath/myfile.myext'

OR remote_source => 'ftp://user:password@myhost.com/mypath/myfile.myext'

The data_source key to the import() tag contains the actual data for the table. in cases where the data comes from the Perl script itself, rather than from a file. The method of specifying the data_source depends entirely on the format of the data_type. For example with data_type of XML or CSV, the data_source is a string in XML or CSV format but with data_type ARRAY, the data_source is a reference to an array of arrayrefs. See below under each data_type for details.

The following keys to the import() method apply only to specific data formats, see the sections on the specific formats (listed in parens) for details:

    pattern   (FIXED only)
   sep_char   (CSV only)
        eol   (CSV only)
   read_sub   (USR and XML only)
       attr   (XML only)
 record_tag   (XML only)
   fold_col   (XML only)
col_mapping   (XML only)
       dirs   (MP3 only)

Saving in-memory tables to disk: export()

The export() method creates a file from an in-memory table. It takes a very similar set of keys as does the import() method. The data_type key specifies the format of the file to be created (CSV, PIPE, TAB, XML, FIXED-WIDTH, etc.). The same set of specifiers available for the import method for these various formats are also available (e.g. sep_char to set the field separator for CSV files, or pattern to set the fixed-width pattern).

The data_source key for the export() method is a SQL select statement in relation to whatever in-memory table is chosen to export. The data_target key specifies the name of the file to put the results in. Here is an example:

$dbh->func( {
    data_source => 'SELECT * FROM table1',
    data_target => 'my_db.fix',
    data_type => 'FIXED',
    pattern   => 'A2 A19',
},'export' );
       

That statement creates a fixed-width record database file called "my_db.fix" and puts the entire contents of table1 into the file using the specified field widths and whatever column names alread exist in table1.

See specific data formats below for details related to the export() method.

Continuous file access: catalog()

The catalog() method creates an association between a specific table name, a specific data type, and a specific file name. You can create those associations for several files at once. The parameter to the catalog() method is a reference to an array of arrayrefs. Each of the arrayrefs should contain a table name, a data type, and a file name and can optionally inlcude other paramtets specific to specific data types. Here is an example:

$dbh->func([
    [ 'my_csv_table', 'CSV',   'test_db.csv'  ],
 ],'catalog');

This example creates an association to a CSV file. Once the catalog() statement has been issued, any DBI/SQL commands relating to "my_csv_table" will operate on the file "test_db.csv". If the command is a SELECT statement, the file witll be opened and searched. If the command is an INSERT statement, the file will be opened and the new data row(s) inserted and saved into the file.

One can also pass additional information such as column names, fixed-width patterns, field and record separators to the export method(). See the import() information above for the meanings of these additional parameters. They should be passed as a hashref:

$dbh->func([
    [ 'table1', 'FIXED', 'test_db.fix',{pattern => 'A2 A19'} ],
    [ 'table2', 'INI',   'test_db.ini',{col_names => 'id,phrase,name' ],
 ],'catalog');

Specifying file and directory names

All filenames are relative to a user-specified file directory: f_dir. The f_dir parameter may be set in the connect statement:

my $dbh=DBI->connect("dbi:RAM:f_dir=/mypath/to-files/" ...

The f_dir parameter may also be set or reset anywhere in the program after the database connection:

$dbh->{f_dir} = '/mypath/to-files'

If the f_dir parameter is not set explicitly, it defaults to "./" which will be wherever your script thinks it is running from (which, depending on server setup may or may not be the physical location of your script so use this only if you know what you are doing).

All filenames are relative to the f_dir directory. It is not possible to use an absolute path to a file.

WARNING: no taint checking is performed on the filename or f_dir, this is the responsiblity of the programmer. Since the filename is relative to the f_dir directory, a filename starting with "../" will lead to files above or outside of the f_dir directory so you should exclude those from filenames if the filenames come from user input.

Using defaults for quick testing & prototyping

If no table_name is specified, a numbered table name will be supplied (table1, or if that exists table2, etc.). The same also applies to column names (col1, col2, etc.). If no data_type is supplied, CSV will be assumed. If the entire hashref parameter to import is missing and an arrayref of data is supplied instead, then defaults for all values will be used, the source will be defaulted to data and the contents of the array will be treated as the data source. For CSV file, a field separator of comma and a record separator of newline are the default. Thus, assuming there are no already exiting in-memory tables, the two statements below have the same effect:

$dbh->func( [<DATA>], 'import' );

$dbh->func({
    table_name  => 'table1',
    data_type   => 'CSV',
    col_names   => 'col1,col2',
    sep_char    => ',',
    eol         => "\n",
    data_source => [<DATA>],
},'import' );

It is also possible to rely on some of the defaults, but not all of them. For example:

$dbh->func({
    data_type   => 'PIPE',
    file_source => 'my_db.pipe',
},'import' );

DATA FORMATS

CSV / PIPE / TAB / INI (Comma,Pipe,Tab,INI & other "delimited" formats)

DBD::RAM can import CSV (Comma Separated Values) from strings, from local files, or from remote files into database tables and export tables from any source to CSV files. It can also store and update CSV files continuously similarly to the way other DBD drivers operate.

If you wish to use remote CSV files, you also need the LWP module installed. It is available from www.activestate.com for windows, and from www.cpan.org for other platforms.

CSV is the format of files commonly exported by such programs as Excel, Access, and FileMakerPro. Typically newlines separate records and commas separate fields. Commas may also be included inside fields if the field itself is surrounded by quotation marks. Quotation marks may be included in fields by doubling them. Although some types of CSV formats allow newlines inside fields, DBD::RAM does not currently support that. If you need that feature, you should use DBD::CSV.

Here are some typical CSV fields:

hello,1,"testing, ""george"", 1,2,3",junk

Note that numbers and strings that don't contain commas do not need quotation marks around them. That line would be parsed into four fields:

	hello
        1
        testing, "george", 1,2,3
        junk

To import that string of CSV into a DBD::RAM table:

$dbh->func({ 
    data_source => qq[hello,1,"testing, ""george"", 1,2,3",junk]
},'import');

Of if one wanted to continuously update a file similarly to the way DBD::CSV works:

$dbh->func([ 'table1', 'CSV',  'my_test.csv' ],'catalog');

Or if that string and others like it were in a local file called 'my_test.csv':

$dbh->func({ file_source => 'my_test.csv' },'import');

Or if that string and others like it were in a remote file at a known URL:

$dbh->func({ remote_source => 'http://www.foo.edu/my_test.csv' },'import');

Note that these forms all use default behaviour since CSV is the default data_type. These methods also use the default table_name (table1,table2,etc.) and default column_names (col1,col2, etc.). The same functions can specify a table_name and can either specify a list of column names or specify that the column names should be taken from the first line of data:

$dbh->func({ 
    file_source => 'my_test.csv',
     table_name => 'my_table',
      col_names => 'name,phone,address',
 },'import');

It is also possible to define other field separators (e.g. a semicolon) with the "sep_char" key and define other record separators with the "eol" key. For example:

sep_char => ';',
eol      => '~',

Adding those to the import() hash would define data that has a semicolon between every field and a tilde between every record.

For convenience shortcuts have been provided for PIPE and TAB separators. The data_type "PIPE" indicates a separator of the pipe character '|' which may optionally have blank spaces before or afer it. The TAB data_type indicates fields that are separated by tabs. In both cases newlines remain the default record separator unless specifically set to something else.

Another shortcut is the INI data_type. This expects to see data in name=value pairs like this:

	resolution = 640x480
        colors     = 256

Currently the INI type does not support sections within the .ini file, but that will change in future releases of this module.

The PIPE, TAB, and INI formats all behave like the CSV format. Defaults may be used for assigning column names from the first line of data, in which case the column names should be separated by the appropriate symbol (e.g. col1|col2 for PIPE, and col1=col2 for INI, and column names separated by tabs for TAB).

In the examples above using data_source the data was a string with newlines separating the records. It is also possible to use an reference to an array of lines as the data_source. This makes it easy to use the DATA segment of a script or to import an array from some other part of a script:

$dbh->func({ data_source => [<DATA>] },'import' );

ARRAYS & HASHES

DBD::RAM can import data directly from references to arrays of arrayrefs and references to arrays of hashrefs. This allows you to easily import data from some other portion of a perl script into a database format and either save it to disk or simply manipulate it in memory.

$dbh->func({
    data_type   => 'ARRAY',
    data_source =>  [
       ['1','CSV:Hello New World!'],
       ['2','CSV:Cool!']
    ],
},'import');

$dbh->func({
    data_type   => 'HASH',
    data_source => [
        {id=>1,phrase=>'Hello new world!'},
        {id=>2,phrase=>'Junkity Junkity Junk'},
    ],
},'import');

FIXED-WIDTH RECORDS

Fixed-width records (also called fixed-length records) do not use character patterns to separate fields, rather they use a preset number of characters in each field to determine where one field begins and another ends. DBD::RAM can import fixed-width records from strings, arrayrefs, local files, and remote files and can export data from any source to fixed-width record fields. The module also allows continuous disk-based updating of fixed-width format files similarly to other DBDs.

The fixed-width format behaves exactly like the CSV formats mentioned above with the exception that the data_type is "FIXED" and that one must supply a pattern key to describe the width of the fields. The pattern should be in Perl unpack format e.g. "A2 A7 A14" would indicate a table with three columns with widths of 2,7,14 characters. When data is inserted or updated, it will be truncated or padded to fill exactly the amount of space alloted to each field.

$dbh->func({ 
    table_name => 'phrases',
    col_names  => 'id,phrase',
    data_type  => 'FIXED',
    pattern    => 'A1 A20',
    data_source => [ '1Hello new world!    ',
                     '2Junkity Junkity Junk',
                   ],
 },'import' );

XML

DBD::RAM can import XML (Extended Markup Language) from strings, from local files, or from remote files into database tables and export tables from any source to XML files.

You must have XML::Parser installed in order to use the XML feature of DBD::RAM. If you wish to use remote XML files, you also need the LWP module installed. Both are available from www.activestate.com for windows, and from www.cpan.org for other platforms.

Support is provided for information in tag attributes and tag text and for multiple levels of nested tags. There are several options on how to treat tag names that occur multiple times in a single record including a variety of relationships between XML tags and database columns: one-to-one, one-to-many, and many-to-one. Tag attributes can be made to apply to multiple records nested within the tag. There is also support for alternate character encodings and other XML::Parser parameter attributes.

See below for details.

XML Import

To start with a very simple example, consider this XML string:

 <staff>
     <person name='Joe' title='bottle washer'/>
     <person name='Tom' title='flunky'/>
     <person name='Bev' title='chief cook'/>
     <person name='Sue' title='head honcho'/>
 </staff>

Assuming you have that XML structure in a variable $str, you can import it into a DBD::RAM table like this:

$dbh->func({
    data_source => $str
    data_type   => 'XML',
    record_tag  => 'staff person',
    col_names   => 'name,title'
},'import');

Which would produce this SQL/DBI accessible table:

name | title
-----+--------------
Joe  | bottle washer
Tom  | flunky
Bev  | chief cook
Sue  | head honcho

If the XML data is in a local or remote file, rather than a string, simply change the "data_source" to "file_source" (for local files) or "remote_source" (for remote files) an everything else mentioned in this section works the same as if the data was imported from a string.

Notice that the "record_tag" is a space separated list of all of the tags that enclose the fields you want to capture starting at the highest level with the <staff> tag. In this example there is only one level of nesting, but there could be many levels of nesting in actual practice.

DBD::RAM can treat both text and tag attributes as fields. So the following three records could produce the same database row:

<person name='Tom' title='flunky'/>

<person name='Tom'> 
   <title>flunky</title>
</person>

<person>
  <name>Tom</name>
  <title>flunky</title>
</person>

The database column names should be specified as a comma-separated string, in the order you want them to appear in the database:

col_names => 'name,state,year'

If you want the database column names to be the same as the XML tag names you do not need to do anything further.

NOTE: you *must* speficy the column names for XML data, you can not rely on automatic default column names (col1,col2,etc.) or on reading the column names from the "first line" of the data as you can with most other data types.

Alternate relationships between XML tags & database columns

If you want the database column names to be different from the XML tag names, you need to add a col_mapping parameter which should be a hash with the XML tag as the key and the database column as the value:

col_mapping => {
    name  => 'Member_Name',
    state => 'Location',
    year =>  'Year',
}

('name' is the XML tag, 'Member_Name' is the database column)

If a given tag occurs more than once in an XML record, it can be mapped onto a single column name (in which case all of the values for it will be concatenated with spaces into the single column), or it can be mapped onto an array of column names (in which case each succeeding instance of the tag will be entered into the succeeding column in the array).

For example, given this XML snippet:

<person name='Joe' state='OR'>
    <year>1998</year>
    <year>1999</year>
</person>
<person name='Sally' state='WA'>
    <year>1998</year>
    <year>1999</year>
    <year>2000</year>
</person>

This column mapping:

col_mapping => {
    name  => 'Member_Name',
    state => 'Location',
    year =>  ['Year1','Year2','Year3'],
}

Would produce this table:

Member_Name | Location | Year1 | Year2 | Year3
------------+----------+-------+-------+------
Joe         | OR       | 1998  | 1999  |
Sally       | WA       | 1998  | 1999  | 2000

And this column mapping:

col_mapping => {
    name  => 'Member_Name',
    state => 'Location',
    year =>  'Year',
}

Would produce this table:

Member_Name | Location | Year
------------+----------+----------------
Joe         | OR       | 1998 1999
Sally       | WA       | 1998 1999 2000

It is also possible to map several differnt tags into a single column, e.g:

<person name='Joe' state='OR'>
  <year1>1998</year1>
  <year2>1999</year2>
</person>
<person name='Sally' state='WA'>
   <year1>1998</year1>
   <year2>1999</year2>
   <year3>2000</year3>
</person>

col_mapping => {
    name  => 'Member_Name',
    state => 'Location',
    year1 => 'Year',
    year2 => 'Year',
    year3 => 'Year',
}

Member_Name | Location | Year
------------+----------+----------------
Joe         | OR       | 1998 1999
Sally       | WA       | 1998 1999 2000

Nested attributes that apply to multiple records

It is also possible to use nested record attributes to create column values that apply to multiple records. Consider the following XML:

<staff>
  <office location='Portland'>
    <person name='Joe'>
    <person name='Tom'/>
  </office>
  <office location='Seattle'>
    <person name='Bev'/>
    <person name='Sue'/>
  </office>
</staff>

One might like to associate the office location with all of the staff members in that office. This is how that would be done:

record_tag  => 'staff office person',
col_names   => 'location,name',
fold_col    => { 'staff office' => 'location' },

That fold-col specification in the import() method would "fold in" the attribute for location and apply it to all records nested within the office tag and produce the following table:

location | name
---------+-----
Portland | Joe
Portland | Tom
Seattle  | Bev
Seattle  | Sue

You may use several levels of folded columns, for example, to capture both the office location and title in this XML:

<staff>
  <office location='Portland'>
    <title id='manager'>
      <person name='Joe'/>
    </title>
    <title id='flunky'>
      <person name='Tom'/>
    </title>
  </office>
  <office location='Seattle'>
    <title id='flunky'>
      <person name='Bev'/>
      <person name='Sue'/>
    </title>
  </office>
</staff>

You would use this fold_col key:

fold_col => { 'staff office'       => 'location',
              'staff office title' => 'id',
            },

And obtain this table:

location | title   | name
---------+---------+-----
Portland | manager | Joe
Portland | flunky  | Tom
Seattle  | flunky  | Bev
Seattle  | flunky  | Sue

If you need to grab more than one attribute from a single tag, you need to put one or more carets (^) on the end of the fold_col key. For example:

<office type='branch' location='Portland' manager='Sue'> ...</office>

fold_col => { 'office'   => 'branch',
              'office^'  => 'location',
              'office^^' => 'manager',

Character Encoding and Unicode issues

The attr key can be used to pass extra information to XML::Parser when it imports a database. For example, if the XML file contains latin-1 characters, one might like to pass the parser an encoding protocol like this:

attr => {ProtocolEncoding => 'ISO-8859-1'},

Attributes passed in this manner are passed straight to the XML::Parser.

Since the results of XML::Parser are returned as UTF-8, one might also like to translate from UTF-8 to something else when the data is entered into the database. This can be done by passing a pointer to a subroutine in the read_sub key. For example:

read_sub    => \&utf8_to_latin1,

For this to work, there would need to be a subroutine utf8_to_latin1 in the main module that takes a UTF8 string as input and returns a latin-1 string as output. Similar routines can be used to translate the UTF8 characters into any other character encoding.

Apologies for being Euro-centric, but I have included a short-cut for Latin-1. One can omit the attr key and instead of passing a pointer to a subroutine in the read_sub key, if one simply puts the string "latin1", the module will automatically perform ISO-8859-1 protocol encoding on reading the XML file and automatically translate from UTF-8 to Latin-1 as the values are inserted in the database, that is to say, a shortcut for the two keys mentioned above.

Other features of XML import

* Tags, attributes, and text that are not specifically referred to in the import() parameters are ignored when creating the database table.

* If a column name is listed that is not mapped onto a tag that occurs in the XML source, a column will be created in the database for that name and it will be given a default value of NULL for each record created.

XML Export

Any DBD::RAM table, regardless of its original source or its original format, can be exported to an XML file.

The export() parameters are the same as for other types of export() -- see the above for details. Additionally there are some export parameters specific to XML files which are the same as the import() parameters for XML files mentioned above. The col_names parameter is required, as is the record_tag parameter. Additionally one may optionally pass a header and/or a footer parameter which will be material that goes above and below the records in the file. If no header is passed, a default header consisting of

<?xml version="1.0" ?>

will be created at the top of the file.

Given a datbase like this:

location | name
---------+-----
Portland | Joe
Seattle  | Sue

And an export() call like this:

$dbh->func({
    data_type   => 'XML',
    data_target => 'test_db.new.xml',
    data_source => 'SELECT * FROM table1',
    record_tag  => 'staff person',
    col_names   => 'name,location',
},'export');

Would produce a file called 'test_db.xml' containing text like this:

<?xml version="1.0" ?>
<staff>
<office>
<person>
  <name>Joe</name>
  <location>Portland</location>
</person>
<person>
  <name>Sue</name>
  <location>Seattle</location>
</person>
</office>
</staff>

The module does not currently support exporting tag attributes or "folding out" nested column information, but those are planned for future releases.

USER-DEFINED DATA STRUCTURES

DBD::RAM can be extended to handle almost any type of structured information with the USR data type. With this data type, you define a subroutine that parses your data and pass that to the import() command and the module will use that routine to create a database from your data. The routine can be as simple or as complex as you like. It must accept a string and return an array with the fields of the array in the same order as the columns in your database. Here is a simple example that works with data separated by double tildes. In reality, you could just do this with the bulit-in CSV type, but here is how you could recreate it with the USR type:

$dbh->func({       
     data_type   => 'USR',
     data_source => "1~~2~~3\n4~~5~~6\n",
     read_sub    => sub { split /~~/,shift },
},'import' );

That would build a table with two rows of three fields each. The first row would contain the values 1,2,3 and the second row would contain the values 4,5,6.

Here is a more complex example that handles a simplified address book. It assumes that your data is a series of addresses separated by blank lines and that the address has the name on the first line, the street on the second line and the town, state, and zipcode on the third line. (Apologies to those in countries that don't have states or zipcodes in this format). Here is an example of the kind of data it would handle:

Fred Bloggs
123 Somewhere Lane
Sometown OR 97215

Joe Blow
567 Anywhere Street
OtherTown WA 98006

Note that the end-of-line separator (eol) has been changed to be a blank line rather than a simple newline and that the parsing routine is more than a simple line by line parser, it splits the third line into three fields for town, state, and zip.

$dbh->func({
  data_type   => 'USR',
  data_source => join('',<DATA>),
  col_names   => 'name,street,town,state,zip',
  eol         => '/^\s*$/',
  read_sub    => sub {
      my($name,$street,$stuff) = split "\n", $_[0];
      my @ary   = split ' ',$stuff;
      my $zip   = $ary[-1];
      my $state = $ary[-2];
      my $town  = $stuff;
      $town =~ s/^(.*)\s+$state\s+$zip$/$1/;
      return($name,$street,$town,$state,$zip);
    },
  },'import');

Given the data above, this routine would create a table like this:

name        | street              | town      | state | zip
------------+---------------------+-----------+-------+------
Fred Bloggs | 123 Somewhere Lane  | Sometown  | OR    | 97215
Joe Blow    | 567 Anywhere Street | OtherTown | WA    | 98006

These are just samples, the possiblities are fairly unlimited.

PLEASE NOTE: If you develop generally useful parser routines that others might also be able to use, send them to me and I can encorporate them into the DBD itself (with proper credit, of course).

DBI DATABASE RECORDS

You can import information from any other DBI accessible database with the data_type set to 'DBI' in the import() method. First connect to the other database via DBI and get a database handle for it separate from the database handle for DBD::RAM. Then do a prepare and execute to get a statement handle for a SELECT statement into that database. Then pass the statement handle to the DBD::RAM import() method as the data_source. This will perform the fetch and insert the fetched fields and records into the DBD::RAM table. After the import() statement, you can then close the database connection to the other database if you are not going to be using it for anything else.

Here's an example using DBD::mysql --

use DBI;
my $dbh_ram   = DBI->connect('dbi:RAM:','','',{RaiseError=>1});
my $dbh_mysql = DBI->connect('dbi:mysql:test','','',{RaiseError=>1});
my $sth_mysql = $dbh_mysql->prepare("SELECT * FROM cars");
$sth_mysql->execute;
$dbh_ram->func({
    data_type   => 'DBI',
    data_source => $sth_mysql,
},'import' );
$dbh_mysql->disconnect;

MP3 MUSIC FILES

Most mp3 (mpeg three) music files contain a header describing the song name, artist, and other information about the music. This shortcut will collect all of the header information in all mp3 files in a group of directories and turn it into a searchable database:

$dbh->func(
    { data_type => 'MP3', dirs => $dirlist }, 'import'
);

$dirlist should be a reference to an array of absolute paths to
directories containing mp3 files.  Each file in those directories
will become a record containing the fields:  file_name, song_name,
artist, album, year, comment,genre. The fields will be filled
in automatically from the ID3v1x header information in the mp3 file
itself, assuming, of course, that the mp3 file contains a
valid ID3v1x header.

USING MULTIPLE TABLES

A single script can create as many tables as your RAM will support and you can have multiple statement handles open to the tables simultaneously. This allows you to simulate joins and multi-table operations by iterating over several statement handles at once. You can also mix and match databases of different formats, for example gathering user info from .ini and .config files in many different formats and putting them all into a single table.

TO DO

Lots of stuff. Allow users to specify a file where catalog information is stored so that one could record file types once and thereafter automatically open the files with the correct data type. A convert() function to go from one format to another. Support for a variety of other easily parsed formats such as Mail files, web logs, and for various DBM formats. Support for HTML files with the directory considered as a table, each HTML file considered as a record and the filename, <TITLE> tag, and <BODY> tags considered as fields. More robust SQL (coming when I update Statement.pm) including RLIKE (a regex-based LIKE), joins, alter table, typed fields?, authorization mechanisms? transactions? Allow remote exports (e.g. with LWP POST/PUT).

Let me know what else...

AUTHOR

Jeff Zucker <jeff@vpservices.com>

Copyright (c) 2000 Jeff Zucker. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself as specified in the Perl README file.

No warranty of any kind is implied, use at your own risk.

SEE ALSO

DBI, SQL::Statement, DBD::File