NAME
DBIx::Compare::ContentChecksum::mysql - Extension to DBIx::Compare, enables more detailed comparison of MySQL databases.
SYNOPSIS
use DBIx::Compare::ContentChecksum::mysql;
my $oDB_Comparison = db_comparison->new($dbh1,$dbh2);
$oDB_Comparison->group_concat_max_len(10000000);
$oDB_Comparison->compare;
DESCRIPTION
DBIx::Compare::ContentChecksum::mysql takes two MySQL database handles and performs a low level comparison of their table content. It was developed to compare databases before and after table optimisation, but would be useful in any scenario where you need to check that two databases are identical.
DBIx::Compare::ContentChecksum::mysql utilises the MySQL functions 'GROUP_CONCAT' and 'MD5' to generate MD5 checksums from each field of a table, ordered by their primary keys, both ascending and descending. Then it simply compares the checksums returned from each database and prints out a list of the tables/fields in which differences were found.
MySQL has a built in variable called group_concat_max_len
. This limits the length of values returned by the group_concat
function, truncating longer values. Helpfully, MySQL will issue a warning to let you know that the returned value has been truncated. The default value for group_concat_max_len
is a paltry 1024 (bytes, I assume) which isn't useful for much. If your database is likely to generate larger concatenated values, then calculate/determine/guess the group_concat_max_len
you'll need and set it with the method "group_concat_max_len()". See the "DISCLAIMER" below.
Caveats
Its worth noting that while this process can conclusively prove that two databases are different, it can only indicate that its likely two databases are identical. My understanding of the way MD5 works is that identical checksums will provide a very high probability that the inputs to the algorithm were identical. DBIx::Compare::ContentChecksum::mysql attempts to improve upon this probability by generating two separate checksums for each field, sorted by ascending or descending primary keys, which it concatenates prior to comparison. (Its conjecture on my part that this improves the comparison, but please feel free to let me know if that's not the case).
Regardless there could always be occassions where the table contents are indeed different, but where the same checksum is produced. For instance, if a field in table 1 has values(1,21), whereas table 2 has values(12,1), then the GROUP_CONCAT function will return '121' in both cases and in both sort orders, and the resulting MD5 checksum will be identical. You'll have to judge for yourself if this kind of issue is likely in your schema, and whether this is a sensible approach to testing your database content.
One issue I discovered during testing of this module is that in some cases, identical data in two tables of different Engine types (MyISAM and InnoDB) will return different MD5 checksums. I've no idea how that happens, but hopefully someone will tell me.... Currently the module does not check the Engine types, and can't bring this to your attention other than telling you the tables are different.
METHODS
- compare
-
Performs the comparison. Calls the methods compare_table_lists, compare_row_counts and compare_fields_checksum in order. In scalar context, returns a hashref of the differences found. In void context this method outputs a report to STDOUT.
- compare_fields_checksum
-
Comparison of the MD5 checksum of each concatenated field. Can pass a table name, or will compare all tables. Returns true if no differences are found, otherwise returns undef. An array ref of fields that return different checksums can be recovered with get_differences(), using the hash key
'Table [table name] fields'
. - field_checksum($table,$field)
-
Returns the combined MD5 checksum for the given table/field. Actually this is two joined checksums - one of the group_concat sorted by ascending primary key, and the other sorted by descending key (or by descending each key in the case of a compound primary key).
- group_concat_max_len()
-
Set the MySQL variable
group_concat_max_len
. This defaults to 1024, but you might need a larger value depending on your data.
DISCLAIMER
I have no idea how big a concatenated value would have to be to cause havoc with your system..... I took care to gradually test the limits of my own system before I risked crashing everything spectacularly, and would suggest you do the same. I accept no responsibility for the consequences of using this module to test a database with billions of rows, containing huge text fields, or massive blobs, and then wonder why it dies... (In fact, this module has not even been tested on blobs, and I don't even know if you can concatenate them).
Having said that, a dual-Quad-core-Xeon MySQL server with 4Gb of RAM and a group_concat_max_len
value of 1,000,000,000 was happy (albeit a bit slow) concatenating 100,000,000 rows of a varchar(20) with an average length of 7.5. This is only a guide to the possibilities, and is in no way a recommendation or formula for success.
SEE ALSO
AUTHOR
Christopher Jones, Gynaecological Cancer Research Laboratories, UCL EGA Institute for Women's Health, University College London.
c.jones@ucl.ac.uk
COPYRIGHT AND LICENSE
Copyright 2008 by Christopher Jones, University College London
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.