NAME

Pg::BulkCopy - Bulk Data Load/ Dump for Postgres.

VERSION

Version 0.20

pg bulkCopy.pl

The utility script pg_BulkCopy.pl was written to provide postgreSQL with a convient bulk loading utility. The script is implemented as a wrapper and a module (pg_BulkCopy.pl) so that other programmers may easily incorporate the two useful methods LOAD and DUMP directly into other perl scripts.

The advantage of this script over other scripts that have been written for this purpose is that if you can connect to and perform insert and delete operations on your database through the standard DBI interface (and it shouldn't matter which of the several postgres driver's you are using), pg_BulkCopy should just work.

The DUMP Method invokes postgres' COPY TO command, and does nothing useful in addition except copying the dump from the temp directory (because postgres may not have permission on where you want the file). You can choose Tab Delimited Text or CSV with whatever delimiter you want and a Null string of your choice.

The LOAD Method is much more interesting, it breaks the load file into chunks of 10000 (configurable) records in the temp directory and tries to COPY FROM, if it fails, it parses the error message for the line number, then it removes the failed line to a rejects file and tries again. As with DUMP you can select the options supported by the postgres COPY command, you can also set a limit on bad records (default is 10).

Command Line Arguments to script:

file|filename|f 
table|t 
load|l  
dump|d  
iscsv|csv
dbistring|dbistr|ds
dbiuser|dbiusr|du 
dbipass|dp 
workingdir|working|w 
tmpdir|tmp 
batchsize|batch|b 
errorlog|error|e 
maxerrors|errors|max 
debug|dbg 
trunc|truncate|tr 
help|h|? 
read|r [to read additional variables out of a file]

Format of a Parameter file (specified with --read):

[options]
filename : blob1.tsv
load : 1
dump :
iscsv :
dbistring : DBI:Pg:dbname=pg_bulkcopy_test;host=127.0.0.1
dbiuser : postgres
dbipass : postgres
table : testing
workingdir : /psql_scr/testing/tdata/
tmpdir : /psql_scr/testing/tmp/
batchsize :
errorlog : 
maxerrors : 50
debug : 2
trunc : 1

Example command line

pg_bulkcopy.pl --filename more1.tsv --iscsv 0 --dbistring "DBI:Pg:dbname=pg_bulkcopy_test;host=127.0.0.1" --dbiuser postgres --dbipass postgres --table testing --workingdir /tempdata --tmpdir /mytempfiles --debug 1

Command Line to load all values from bulkcopy.conf

pg_bulkcopy.pl --read "bulkcopy.conf"

Command Line to load values from bulkcopy.conf but provide or override some values from the command line. Values given on the command line take precedence over conflicting values read from file.

pg_bulkcopy.pl --read "bulkcopy.conf" --dbistring "DBI:Pg:dbname=pg_bulkcopy_test;host=127.0.0.1" --dbiuser postgres --dbipass postgres 

Description of Command Line Parameters

dbistr, dbiuser, dbipass

These are the parameters needed to establish a dbi connection. Please refer to the documentation establishing a database connection with dbi.

filename, table

filename is a tab or comma seperated values text file containing the data to be imported or exported. table indicates the table in the connected database to be used for the operation.

iscsv, load and dump, trunc

Boolean values of 0 or 1. An iscsv value of 1 indicates the file is csv. The default iscsv value of 0 indicates tab seperated. The default operation is load (load = 1), setting dump to 1 will set load to 0 and cause the program to dump instead. trunc causes an explicit truncation (deletion) of all data in the table prior to the requested operation (not useful with dump).

workingdir, tempdir, errorlog

workingdir is where the file, reject and log files will be written, unless the full path/filename is specified it is also expected to find/write the file for the operation here. errorlog is the name of a file to write information about problems to, this will default to <filename>.log.

tempdir where the temporary working files will be written to. tempdir defaults to /tmp. Do not overlook tempdir the user executing the script and the uid that postgres is running under must have rw permissions here and the default creation mask must permit access to each other's newly created files!

batchsize, maxerrors, debug

batchsize controls the size of the chunks used for loading, the default is 10,000. With clean data a larger batch size will spead processing, with dirty data smaller batches will improve performance. Every time an error is encountered the offending record needs to be eliminated from the batch, which is currently done inneficiently by re-writing the file.

maxerrors tells the program to abort if too many errors are found. The default is 10.

debug can disable or increase the amount of error logging done. 0 disables error logging, normal is 1.

Module Pg::BulkCopy

All methods used by pg_BulkCopy.pl are provided by Pg::BulkCopy. The method names follow the convention of explicitely defined methods in caps and methods provided by Moose in lowercase.

Systems Supported

This utility is specific to postgreSQL. It is a console application for the server itself. The postgres process must be able to access the data files through the local file system on the server. The utility is targeted towards recent versions of postgres running on unix-like operating systems, if you need to run it on Windows good luck and let me know if you succeed!

Using Pg::BulkCopy

my  $PGBCP = Pg::BulkCopy->new(
dbistring  => $dbistr,
dbiuser    => $dbiuser,
dbipass    => $dbipass,
table      => $table,	
filename   => $filename,
workingdir => "$tdata/",
iscsv      => 1,
maxerrors  => 10,
errorlog   => 'myload.log', 
);

The above example shows the creation of a new BulkCopy object. It requires the dbi information and the name of a table. workingdir will default to /tmp, and filename is required. The default behaviour is Tab Seperated so iscsv is only required for a csv import, icsv => 0 will explicitely request tsv. errorlog is defaulted to $workingdir/pg_BulkCopy.ERR and can be safely omitted. To disable all logging set debug => 0, to log everything set it to 2. maxerrers defaults to 10, setting 0 changes it to an arbitrary large number. All properties have a getter $PGBCP->dbistring() will tell you what the string is, while this property is Read Only, most properties are also a setter to change the value without creating a new object.

The methods DUMP, TRUNC, and LOAD do most of the work. You can use errcode and errstr to findout about the results.

The tests in the distributions t2 folder can be referred to for additional examples.

Methods for Pg::BulkCopy

CONN

Returns the dbi connection, initializing it if necessary.

TRUNC

If the Trunc option is specified, delete all records from table with the postgres TRUNCATE command, instead of carrying out a LOAD or DUMP operation.

LOAD

The main subroutine for importing bulk data into postgres.

DUMP

The main subroutine for exporting bulk data into postgres.

LOG

Write to the log file being used by Pg::BulkCopy. Takes a scalar value or an array, items in an array are written on seperate lines. Remember that if debug is 0 nothing will ever be logged.

maxerrors

Gets or sets the maximum errors in a job. Setting 0 actually sets an arbitrary large number instead. The default is 10

batchsize

Gets or sets the batch size. If there are few errors in the source a large batch size is appropriate, if there are many errors a smaller batch will speed processing. The default is 10000.

errcode, errstr

Returns the last error in numeric or string form. Generally this is just passed back from dbi.

iscsv

Toggle between Command Tab seperated input. The default is tab seperated, 0.

workingdir, tempdir, and filename

The workingdir is where Pg::BulkCopy will look for the data file and where it will write any reject or log files. The tempdir is a scratch directory which both the script user and the postgres user have read write access. The default of both workingdir and tempdir is /tmp. Finally a file name for input or output is needed.

Private subroutines

BUILD

Is a moose component, it is run "after new".

_DBG

is used internally for outputting to stderr and the log file.

Troubleshooting and Issues:

Permissions

The most persistent problem in getting Pg::BulkCopy to work correctly is permissions. First one must deal with hba.conf. Then once you are able to connect as the script user to psql and through a dbi connection you must deal with the additional issue that you are probably not running the script as the account postgres runs under. The account executing the script must be able to read and execute the script directories, read and write the working directory and the temp directory. Finally the account running the Postgres server must be able to read and write in the temp directory (which is defaulted to /tmp).

To deal proactively with permissions issues I recommend the following steps. Check umask in /etc/profile, and change it to something like 002 (which gives owner and group read/write other read). Create a group containing the users of the script and the postgres user. On the directory where you are running the scripts, the temp directory and the one containing data use chmod to set the Special bit (chmod g+s). Make sure that the directory and any pre-existing files have the correct group set. Touch a file as a user in the group and confirm that the group is set to the group and not the user. Other options are to use the ACL feature to manage permissions or to try running the script as the postgres user.

Other Issues

There is currently an issue I haven't resolved with a quoted csv input test file. The next features I expect to work on involve supporting csv headers and field reordering, which will also make the feature available for tsv files.

Options

No CSV Headers

CSV Headers are not supported yet, you'll need to chop them off yourself. Field reordering also isn't supported. These features will be reconsidered for later versions.

Testing

To properly test the module and script it is necessary to have an available configured database. So that the bundle can be installed silently through a cpan utility session very few tests are run during installation. Proper testing must be done manually. Due to the size of the test data it has been removed to a seperate archive, Pg-BulkCopyTest which must be downloaded seperately from cpan. Normally the contents would be restored to the tdata directory.

Create and connect to the database

First make sure that the account you are using for testing has sufficient rights on the server. The sql directory contains a few useful scripts for creating a test database. On linux a command like this should be able to create the database: psql postgres < create_test.sql. dbitest.pl adds a row to your new database and then deletes it, use dbitest to verify your dbi string and that it can access the database.

The real tests are in t2.

Edit the file t2/test.conf. You will need to provide the necessary dsn values for the dbi connection.

If necessary modify harness.sh from the distribution directory as appropriate and execute it to run the tests.

AUTHOR

John Karr, <brainbuz at brainbuz.org>

BUGS

Please report any bugs or feature requests to bug-pg-bulkcopy at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Pg-BulkCopy. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Pg::BulkCopy

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2010 John Karr.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 3 or at your option any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

A copy of the GNU General Public License is available in the source tree; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.