NAME

Text::RecordDeduper - Separate complete, partial and near duplicate text records

SYNOPSIS

use Text::RecordDeduper;

my $deduperr = new Text::RecordDeduper;

# Find and remove entire lines that are duplicated
$deduperr->dedupe_file("orig.txt");

# Dedupe comma separated records, duplicates defined by several fields
$deduper->field_separator(',');
$deduper->add_key(field_number => 1, ignore_case => 1 );
$deduper->add_key(field_number => 2, ignore_whitespace => 1);

# Find 'near' dupes by allowing for given name aliases
my %nick_names = (Bob => 'Robert',Rob => 'Robert');
my $near_deduper = new Text::RecordDeduper();
$near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;
$near_deduper->dedupe_file("names.txt");

DESCRIPTION

This module allows you to take a text file of records and split it into a file of unique and a file of duplicate records.

Records are defined as a set of fields. Fields may be sepearted by spaces, commas, tabs or any other delimiter. Records are separated by a new line.

If no options are specifed, a duplicate will be created only when an entire record is duplicated.

By specifying options a duplicate record is defined by which fields or partial fields must not occur more than once per record. There are also options to ignore case sensitivity, leading and trailing white space.

Additionally 'near' or 'fuzzy' duplicates can be defined. This is done by creating aliases, such as Bob => Robert.

Example

Given a text file names.txt with space separated values and duplicates defined by the second and third columns:

100 Robert   Smith    
101 Bob      Smith    
102 John     Brown    
103 Jack     White   
104 Bob      Smythe    
105 Robert   Smith    

use Text::RecordDeduper;

my %nick_names = (Bob => 'Robert',Rob => 'Robert');
my $near_deduper = new Text::RecordDeduper();
$near_deduper->field_separator(' ');
$near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;
$near_deduper->add_key(field_number => 3) or die;
$near_deduper->dedupe_file("names.txt");

Text::RecordDeduper will produce a file of unique records, names_uniqs.txt

100 Robert   Smith    
102 John     Brown    
103 Jack     White   
104 Bob      Smythe    

and a file of duplicates, names_dupes.txt

101 Bob      Smith    
105 Robert   Smith   

The original file, names.txt is left intact.

METHODS

new

The new method creates an instance of a deduping object. This must be called before any of the following methods are invoked.

field_separator

Sets the token to use as the field delimiter. Accepts any character as well as Perl escaped characters such as \t etc. If this method ins not called the deduper assumes you have fixed width fields .

$deduper->field_separator(',');

add_key

Lets you add a field to the definition of a duplicate record. If no keys have been added, the entire record will become the key, so that only records duplicated in their entirity are removed.

$deduper->add_key
(
    field_number => 1, 
    key_length => 5, 
    ignore_case => 1,
    ignore_whitespace => 1,
    alias => \%nick_names
);
field_number

Specifies the number of the field in the record to add to the key (1,2 ...). Note that this option only applies to character separated data. You will get a warning if you try to specify a field_number for fixed width data.

start_pos

Specifies the position of the field in characters to add to the key. Note that this option only applies to fixed width data. You will get a warning if you try to specify a start_pos for character separated data. You must also specify a key_length

key_length

The length of a key field. This must be specifed if you are using fixed width data (along with a start_pos). It is optional for character separated data.

ignore_case

When defining a duplicate, ignore the case of characters, so Robert and ROBERT are equivalent.

ignore_whitespace

When defining a duplicate, ignore white space that leasd or trails a field's data.

alias

When defining a duplicate, allow for aliases substitution. For example

my %nick_names = (Bob => 'Robert',Rob => 'Robert');
$near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;

Whenever field 2 contains 'Bob', it will be treated as a duplicate of a record where field 2 contains 'Robert'.

dedupe_file

$deduper->dedupe_file("orig.txt");

TO DO

Allow for multi line records
Add batch mode driven by config file or command line options
Allow user to warn when over writing output files
Allow user to customise suffix for uniq and dupe output files

SEE ALSO

sort(3), uniq(3), Text::ParseWords, Text::RecordParser, Text::xSV

AUTHOR

RecordDeduper was written by Kim Ryan <kimryan at cpan d o t org>

COPYRIGHT AND LICENSE

Copyright (C) 2005 Kim Ryan.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.