NAME

Text::RecordDeduper - Remove duplicate records from a text file

SYNOPSIS

use Text::RecordDeduper;

my $deduper = new Text::RecordDeduper;

# Find and remove entire lines that are duplicated
$deduper->dedupe_file("orig.txt");

# Dedupe comma seperated records, duplicates defined by several fields
$deduper->field_separator(',');
$deduper->add_key(field_number => 1, ignore_case => 1 );
$deduper->add_key(field_number => 2);

# Find 'near' dupes by allowing for given name aliases
my %nick_names = (Bob => 'Robert',Rob => 'Robert');
my $near_deduper = new Text::RecordDeduper();
$near_deduper->add_key(field_number => 2, alias => \%nick_names) or die;
$near_deduper->dedupe_file("names.txt");

DESCRIPTION

This module allows you to take a text file of records and split it into a file of unique and a file of duplicate records.

Records are defined as a set of fields. Fields may be sepearted by spaces, commas, tabs or any other delimiter. Records are separated by a new line.

If no options are specifed, a duplicate will be created only when an entire record is duplicated.

By specifying options a duplicate record is definedby which fields or parts of fields must not occur more than once per record. There are also options to ignore white space and case sensitivity.

Additionally 'near' or 'fuzzy' duplicates can be defined. This is done by creating aliases, such as Bob => Robert

Example

Given a text file names.txt with space separated values and duplicates defined by the second and third columns:

100 Robert Smith 101 Bob Smith 102 John Brown 103 Jack White 104 Bob Smythe 105 Robert Smith

use Text::RecordDeduper;

my %nick_names = (Bob => 'Robert',Rob => 'Robert'); my $near_dedupes = new Text::RecordDeduper(); $near_dedupes->field_separator(' '); $near_dedupes->add_key(field_number => 2, alias => \%nick_names) or die; $near_dedupes->add_key(field_number => 3) or die; $near_dedupes->dedupe_file("names.txt");

Text::RecordDeduper will produce a file of unique records, names_uniqs.txt

100 Robert Smith 102 John Brown 103 Jack White 104 Bob Smythe

and a file of duplicates, names_dupes.txt

101 Bob Smith 105 Robert Smith

The original file, names.txt is left intact.

METHODS

new

The new method creates an instance of a deduping object. This must be called before any of the following methods are invoked.

field_separator

Sets the token to use as the field delimiter. Accepts any character as well as Perl escaped characters such as \t etc. If this method ins not called the deduper assumes you have fixed width fields .

$deduper->field_separator(',');

add_key

$deduper->add_key(field_number => 1, ignore_case => 1 );

dedupe_file

$deduper->dedupe_file("orig.txt");

TO DO

Allow for multi line records Ignore leading and trailing white space in fields Add batch mode drive by a config file Allow user to warn when overwritting output files Allow user ot customise suffix fo uniq and dupe output files

AUTHOR

RecordDeduper was written by Kim Ryan <kimryan at cpan d o t orgE><gt>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

To install Text::RecordDeduper, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::RecordDeduper

CPAN shell

perl -MCPAN -e shell
install Text::RecordDeduper

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)