NAME

Text::DeDuper - near duplicates detection module

SYNOPSIS

use Text::DeDuper;

$deduper = new Text::DeDuper();
$deduper->add_doc("doc1", $doc1text);
$deduper->add_doc("doc2", $doc2text);

@similar_docs = $deduper->find_similar($doc3text);

...

# delete near duplicates from an array of texts
$deduper = new Text::DeDuper();
foreach $text (@texts)
{
    next if $deduper->find_similar($text);
    
    $deduper->add_doc($i++, $text);
    push @no_near_duplicates, $text;
}

DESCRIPTION

This module uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.

Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.

METHODS

new (CONSTRUCTOR)

$deduper = new Text::DeDuper(<attribute-value-pairs>);

Create a new DeDuper instance. Supported attributes are described bellow, in the Attributes section.

add_doc

$deduper->add_doc($document_id, $document_text);

Add a new document to the DeDuper's database. The $document_id must be unique for each document.

find_similar

$deduper->find_similar($document_text);

Returns (possibly empty) array of document IDs of documents in the DeDuper's database similar to the $document_text. This can be very simply used for testing whether a near-duplicate document is in the database:

if ($deduper->find_similar($document_text))
{
    print "at least one near duplicate found";
}

clean

$deduper->clean()

Removes all documents from DeDuper's database.

ATTRIBUTES

Attributes can be set using the constructor:

$deduper = new Text::DeDuper(
    ngram_size => 4,
    encoding   => 'iso-8859-1'
);

... or using the object methods:

$deduper->ngram_size(4);
$deduper->encoding('iso-8859-1');

The object methods can also be used for retrieving the values of the attributes:

$ngram_size = $deduper->ngram_size();
@stoplist   = $deduper->stoplist();
encoding

The characters encoding of processed texts. Must be set to correct value so that alphabetical characters could be detected. Accepted values are those supported by the Encode module (see Encode::Supported).

default: 'utf8'

sim_trsh

The similarity treshold defines how similar two documents must be to be considered near duplicates. The boundary values are 0 and 1. The similarity value of 1 indicates that the documents are exactly the same. The value of 0 on the other hand means that the documents do not share any n-gram.

Any two documents will have the similarity value below the default treshold unless they share a significant part of text.

default: 0.2

ngram_size

The document similarity is based on the information of how many n-grams the documents have in common. An n-gram is a sequence of any n immeadiately subsequent words. For example the text

she sells sea shells on the sea shore

contains following 5-grams:

she sells sea shells on
sells sea shells on the
sea shells on the sea
shells on the sea shore

This attribute specifies the value of n (the size of n-gram).

default: 5

stoplist

The stoplist is a list of very frequent words for given language (for English e.g. a, the, is, ...). It is a good idea to remove the stoplist words from texts before similarity is computed, because it is quite likely that two documents will share n-grams of frequent words even if they are not similar at all.

The stoplist can be specified both as an array of words and as a name of a file where the words are stored one per line:

$deduper->stoplist('a', 'the', 'is', @next_stopwords);
$deduper->stoplist('/path/to/english_stoplist.txt');

Do not worry if you do not have a stoplist for your language. DeDuper will do pretty good job even without the stoplist.

default: empty

MODULE DEPENDENCIES

Encode

For decoding texts in various characters encodings into Perl's internal form.

Digest::MD4

For n-grams hashing optimisation.

BUGS

Please report any bugs or feature requests to bug-Text-DeDuper@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SEE ALSO

Encode, Encode::Supported, Digest::MD4

Andrei Z. Broder at al., Syntactic Clustering of the Web

http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html

Contains among other things definition of the resemblance measure.

AUTHOR

Jan Pomikalek, <xpomikal@fi.muni.cz>

COPYRIGHT & LICENSE

Copyright 2006 Jan Pomikalek, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 83:

=cut found outside a pod block. Skipping to next block.