The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Data::Sync - A simple metadirectory/datapump module

SYNOPSIS

 use Data::Sync;

 my $sync = Data::Sync->new(log=>"STDOUT",[configfile=>"config.dds"],[jobname=>"MyJob"]);

 $sync->source($dbhandle,{
                                select=>"select * from testtable",
                                index=>"NAME",
                                hashattributes=>["ADDRESS","PHONE"]
                        });

 or

 $sync->source($ldaphandle,{filter=>"(cn=*)",
                                scope=>"sub",
                                base=>"ou=testcontainer,dc=test,dc=org"});

 $sync->target($dbhandle,{table=>'targettable',
                                index=>'NAME'});

 or

 $sync->target($ldaphandle);

 $sync->mappings(FIRSTNAME=>'givenName',SURNAME=>'sn');

 $sync->buildattributes(dn=>"cn=%NAME%,ou=testcontainer,dc=test,dc=org",
                        objectclass=>"organizationalUnit");

 $sync->transforms(     PHONE=>'s/0(\d{4})/\+44 \($1\) /',
                        ADDRESS=>sub{my $address=shift;
                                        $address=~s/\n/\<BR\>/g;
                                        return $address});

 $sync->validation(     address=>"/street/i",
                        name=>"/(Dr|Mr|Mrs|Ms|Miss)/" );

 $sync->save("filename");

 $sync->load("filename");

 $sync->run();

 print $sync->error();

 print $sync->lastruntime();

 print $sync->commit();

DESCRIPTION

Data::Sync is a simple metadirectory/data pump module. It automates a number of the common tasks required when writing code to migrate/sync information from one datasource to another.

In order to use Data::Sync, you must define a source and a target. The first parameter to the source & target methods is a bound DBI/Net::LDAP handle.

Having defined your datasources, define how attributes map between them with mappings. If an attribute returned from the data source has no entry in the mapping table, it will be assumed that the name is the same in both databases.

Attributes can be built up from multiple other attributes using buildattributes. This uses a simple template, %FIELDNAME% which is replaced at run time with the value of the field from the current record. More complex modifications to data can be made with transforms, which runs after the built attributes are created.

Transforms can be made with the method transforms, which takes a hash of FIELDNAME=>transformation. This transformation can be one of three things: a regular expression in string form (see synopsis), the name of a predefined transformation supplied in Data::Sync, or a code ref.

Finally, if you are confident your data is all in the right format, use run. That will read the data from the source, modify it as you have specified, validate it against the pattern matches you've specified (if any) and write it to the target.

WARNING! There is no implied or real warranty associated with the use of this software. That's fairly obvious, but worth repeating here. Metadirectory applications have the potential to destroy data in a very big way - they must be constructed carefully, and thoroughly tested before use on a live system.

CONSTRUCTOR

 my $sync = Data::Sync->new(log=>"STDOUT");
 my $sync = Data::Sync->new(log=>$fh);
 my $sync = Data::Sync->new(configfile=>"config.dds");
 my $sync = Data::Sync->new(jobname=>"MyJob");

The constructor returns a Data::Sync object. Optionally, to use logging, pass the string STDOUT as the log parameter to print logging to STDOUT, or a lexical filehandle. You can specify a config file to get the configuration from, in which case you don't need to call mappings/transforms etc, although you'll still need pass the db/ldap handles (only) to source & target.

If you are using attribute hashing to minimise unnecessary writes, you should specify a jobname, as this is the name given to the SQLite hash database.

METHODS

source

 $sync->source($dbhandle,{select=>"select * from testtable"});

 or

 $sync->source($ldaphandle,{filter=>"(cn=*)",
                                scope=>"sub",
                                base=>"ou=testcontainer,dc=test,dc=org"});

 or

 $sync->source($dbhandle); # only if loading config file

Requires a valid, bound (i.e. logged in) Net::LDAP or DBI handle, and a hash of parameters for the data source (assuming you aren't loading the config from a file). LDAP parameters are: filter scope base attrs controls

(See Net::LDAP for more details of these parameters).

DBI parameters are: select

Other source options:

By default, the source method will define the read operation as 'all in one'. If you want to handle data in batches, specify

 batchsize=>x

in the hash of read criteria. This will read a batch from the handle, perform the operation, read the next batch from the handle, and so on. Note that this will still be working against an entire record set matching your criteria, so the memory advantages are limited.

Attribute hashing can be specified with the keys:

 index=>"index/key attribute"
 hashattributes=>["attrib","attrib","attrib"]

When running, this will create an MD5 hash of the concatentation of the specified attributes, and store it in a database under the specified index. Next time the job is run, it will hash the value again, and compare it with the last hashed value. If they are the same, the record will not be written to the target. These entries are stored in a SQLite database - if you want to manipulate the database directly, you can do so with a sqlite3 client. The SQLite database takes it's name from the 'jobname' attribute specified in $sync->new. If you didn't specify a jobname, it will default to 'noname' - so if you are running multiple jobs with attribute hashing in the same directory on your disk, it's important to make sure they have names.

target

 $sync->target($dbhandle,{table=>'targettable',
                                index=>'NAME'});

 or

 $sync->target($ldaphandle);

 or

 $sync->target($db); # only if loading config from a file

Requires a valid, bound (i.e. logged in) DBI or Net::LDAP handle, and a hash of parameters (unless you are loading the config from a file). No parameters are required for LDAP data targets, but a dn must have been either read from the data source or constructed using buildattributes. Valid DBI parameters are

 table - the table you wish to write to on the data target
 index - the attribute you wish to use as an index

There is no 'pre check' on datatypes or lengths, so if you attempt to write a record with an oversized or mismatched data type, it will fail with an error.

Note: if you are writing from DB to LDAP, you must construct all mandatory attributes using buildattributes, or additions will fail.

mappings

 $sync->mappings(FIRSTNAME=>'givenName',SURNAME=>'sn');

Maps individual field names from the data source, to their corresponding field names in the data target.

buildattributes

 $sync->buildattributes(dn=>"cn=%NAME%,ou=testcontainer,dc=test,dc=org",
                        objectclass=>"organizationalUnit");

Builds new target attributes up from existing source attributes. A simple template form is used - the template should be a string variable, containing the source field name between % delimiters. If no % delimiters are found, the string will be written as a literal.

transforms

 $sync->transforms(     PHONE=>'s/0(\\d{4})/\+44 \(\$1\)/',
                        OFFICE=>"stripspaces",
                        ADDRESS=>sub{my $address=shift;
                        $address=~s/\n/\<BR\>/g;
                        return $address});

Converts each field in the source data using the parameters passed. Each parameter pair is the target field name, along with a regex (in a string), a coderef, or a standard function. The following list of transformation functions are supplied in this version:

 stripspaces
 stripnewline
 uppercase
 lowercase
 concatenate

concatenate joins together the values of a multi valued attribute with the content of $sync->{mvseparator} - this defaults to | but can be changed with:

 $sync->mvseparator("<separator>");

Transformations are recursive, so if you are importing some form of hierarchical data, the transformation will walk the tree until it finds a scalar (or a list, in the case of concatenate) that it can perform the transformation on.

Note: If passing a regex in a string, make sure you use single quotes. Double quotes will invite perl to interpolate the contents, with unexpected results.

validation

 $sync->validation(     address=>"/street/i",
                        name=>"/(Dr|Mr|Mrs|Ms|Miss)/" );

Validation defines pattern matches for attributes. The validation methods are the last to be called before writing. If any of the specified fields fail to match the specified pattern match, the whole validation will fail and the write will not happen. Validation is optional, you don't have to specify a validation set, but it's useful to ensure that the constructed record set is what you were expecting before you write it out. Validation is also recursive, so it will handle multi valued attributes and subtrees in LDAP.

save

 $sync->save("filename");

Saves the config to a Data::Dump::Streamer file. Returns 1 on success.

load

 $sync->load("filename");

Loads the config from a Data::Dump::Streamer file previously created with save. You still need to define the source and target db/ldap handles with source & target, but if you've loaded the config from a file you can omit the hash of options.

run

 $sync->run() or die $sync->error."\n";

No parameters. Reads the data from the source, converts and renames it as defined in mappings, buildattributes and transforms, and writes it to the target.

error

 print $sync->error;

Returns the last error encountered by the module. This is set e.g. when a file fails to load correctly, when a sql error is encountered etc. When this occurs, the return value from the called function will be zero, and error() should be called to identify the problem.

lastruntime

 print $sync->lastruntime;

Returns the last time the job was run as YYYYMMDDHHMMSS. This is saved in the config file.

mvseparator

 $sync->mvseparator("<separator>");

 print $sync->mvseparator();

Sets or returns the multi valued attribute separator. (defaults to |)

commit

 $sync->commit();

Calls the write handle commit method, where the write handle is DBI (there's no rollback/commit available in LDAP). This is provided as a convenience, just in case you have autocommit turned off on your db handle.

PREREQS

Data::Dump::Streamer

If you are using DBI datasources, you will need DBI & the appropriate DBI drivers.

If you are using LDAP datasources, you will need Net::LDAP.

If you are using attribute hashing, you will also need DBI & DBD::SQLite

VERSION

0.06

BUGS & CAVEATS

Data::Sync handles column/attribute names as case sensitive. Postgresql (at least at time of writing) appears to return column names as lc, whether they're created lc or not. I make no guarantees about this, but using lower case column names in all Data::Sync code seems to work OK. Please ensure that any code you write using this module with postgresql is particularly thoroughly tested.

TODO

Friendly CSV/TD source/target methods

Modular datasource/targets for including non dbi/ldap datasources?

Example using AnyData & XML

Deletion support (somehow, anyhow....)

Delta support/timestamp detection/changelog & persistent search

Multiple sources in a single job?

Multiple targets in a single job?

Caching?

UTF8/ANSI handling.

Perltidy the tests (thanks for spotting the mess Gavin)

Use SQL::Abstract instead of constructing statements?

CHANGES

v0.06

Implemented a commit method for the writehandle.

Implemented validate function

v0.05

Fixed some 0E0 problems with return values from DBI.

Added postgresql caveat.

Extended developer test suite to include MySQL & Postgresql

v0.04

Implemented basic attribute hashing

Added concatenate function for multivalued ldap attributes

v0.03

Added uppercase and lowercase transformations

Moved read and write subs out of anonymous blocks

hid raw regex in #!#<regex>#!# inside coderef for regex transformations (can be parsed out for display/edit in gui)

implemented batch updating

V0.02

Implemented load & save functions.

Implemented error function

Modified stripnewlines to replace with whitespace.

COPYRIGHT

Copyright (c) 2004-2005 Charles Colbourn. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Charles Colbourn

charlesc@g0n.net