NAME
Text::RecordDeduper - Remove duplicate records from a text file
SYNOPSIS
use Text::RecordDeduper;
my $deduper = new Text::RecordDeduper;
# Find and remove entire lines that are duplicated
$deduper->dedupe_file("orig.txt");
# Dedupe comma seperated records, duplicates defined by several fields
$deduper->field_separator(',');
$deduper->add_key(field_number => 1, ignore_case => 1 );
$deduper->add_key(field_number => 2);
# Find 'near' dupes by allowing for given name aliases
my %nick_names = (Bob => 'Robert',Rob => 'Robert');
my $near_deduper = new Text::RecordDeduper();
$near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;
$near_deduper->dedupe_file("names.txt");
DESCRIPTION
This module allows you to take a text file of records and split it into a file of unique and a file of duplicate records.
Records are defined as a set of fields. Fields may be sepearted by spaces, commas, tabs or any other delimiter. Records are separated by a new line.
If no options are specifed, a duplicate will be created only when an entire record is duplicated.
By specifying options a duplicate record is definedby which fields or parts of fields must not occur more than once per record. There are also options to ignore white space and case sensitivity.
Additionally 'near' or 'fuzzy' duplicates can be defined. This is done by creating aliases, such as Bob => Robert
Example
Given a text file names.txt with space separated values and duplicates defined by the second and third columns:
100 Robert Smith 101 Bob Smith 102 John Brown 103 Jack White 104 Bob Smythe 105 Robert Smith
use Text::RecordDeduper;
my %nick_names = (Bob => 'Robert',Rob => 'Robert'); my $near_dedupes = new Text::RecordDeduper(); $near_dedupes->field_separator(' '); $near_dedupes->add_key(field_number => 2, alias => \%nick_names) or die; $near_dedupes->add_key(field_number => 3) or die; $near_dedupes->dedupe_file("names.txt");
Text::RecordDeduper will produce a file of unique records, names_uniqs.txt
100 Robert Smith 102 John Brown 103 Jack White 104 Bob Smythe
and a file of duplicates, names_dupes.txt
101 Bob Smith 105 Robert Smith
The original file, names.txt is left intact.
METHODS
new
The new
method creates an instance of a deduping object. This must be called before any of the following methods are invoked.
field_separator
Sets the token to use as the field delimiter. Accepts any character as well as Perl escaped characters such as \t etc. If this method ins not called the deduper assumes you have fixed width fields .
$deduper->field_separator(',');
add_key
$deduper->add_key(field_number => 1, ignore_case => 1 );
dedupe_file
$deduper->dedupe_file("orig.txt");
TO DO
Allow for multi line records Ignore leading and trailing white space in fields Add batch mode drive by a config file Allow user to warn when overwritting output files Allow user ot customise suffix fo uniq and dupe output files
SEE ALSO
sort(3), uniq(3) Text::RecordParser,Text::xSV
AUTHOR
RecordDeduper was written by Kim Ryan <kimryan at cpan d o t orgE><gt>
COPYRIGHT AND LICENSE
Copyright (C) 2005 Kim Ryan.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.