NAME
Text::DeDuper - near duplicates detection module
SYNOPSIS
use Text::DeDuper;
$deduper = new Text::DeDuper();
$deduper->add_doc("doc1", $doc1text);
$deduper->add_doc("doc2", $doc2text);
@similar_docs = $deduper->find_similar($doc3text);
...
# delete near duplicates from an array of texts
$deduper = new Text::DeDuper();
foreach $text (@texts)
{
next if $deduper->find_similar($text);
$deduper->add_doc($i++, $text);
push @no_near_duplicates, $text;
}
DESCRIPTION
This module uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.
Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.
METHODS
new (CONSTRUCTOR)
$deduper = new Text::DeDuper(<attribute-value-pairs>);
Create a new DeDuper instance. Supported attributes are described bellow, in the Attributes section.
add_doc
$deduper->add_doc($document_id, $document_text);
Add a new document to the DeDuper's database. The $document_id
must be unique for each document.
find_similar
$deduper->find_similar($document_text);
Returns (possibly empty) array of document IDs of documents in the DeDuper's database similar to the $document_text
. This can be very simply used for testing whether a near-duplicate document is in the database:
if ($deduper->find_similar($document_text))
{
print "at least one near duplicate found";
}
clean
$deduper->clean()
Removes all documents from DeDuper's database.
ATTRIBUTES
Attributes can be set using the constructor:
$deduper = new Text::DeDuper(
ngram_size => 4,
encoding => 'iso-8859-1'
);
... or using the object methods:
$deduper->ngram_size(4);
$deduper->encoding('iso-8859-1');
The object methods can also be used for retrieving the values of the attributes:
$ngram_size = $deduper->ngram_size();
@stoplist = $deduper->stoplist();
- encoding
-
The characters encoding of processed texts. Must be set to correct value so that alphabetical characters could be detected. Accepted values are those supported by the Encode module (see Encode::Supported).
default: 'utf8'
- sim_trsh
-
The similarity treshold defines how similar two documents must be to be considered near duplicates. The boundary values are 0 and 1. The similarity value of 1 indicates that the documents are exactly the same. The value of 0 on the other hand means that the documents do not share any n-gram.
Any two documents will have the similarity value below the default treshold unless they share a significant part of text.
default: 0.2
- ngram_size
-
The document similarity is based on the information of how many n-grams the documents have in common. An n-gram is a sequence of any n immeadiately subsequent words. For example the text
she sells sea shells on the sea shore
contains following 5-grams:
she sells sea shells on sells sea shells on the sea shells on the sea shells on the sea shore
This attribute specifies the value of n (the size of n-gram).
default: 5
- stoplist
-
The stoplist is a list of very frequent words for given language (for English e.g. a, the, is, ...). It is a good idea to remove the stoplist words from texts before similarity is computed, because it is quite likely that two documents will share n-grams of frequent words even if they are not similar at all.
The stoplist can be specified both as an array of words and as a name of a file where the words are stored one per line:
$deduper->stoplist('a', 'the', 'is', @next_stopwords); $deduper->stoplist('/path/to/english_stoplist.txt');
Do not worry if you do not have a stoplist for your language. DeDuper will do pretty good job even without the stoplist.
default: empty
MODULE DEPENDENCIES
- Encode
-
For decoding texts in various characters encodings into Perl's internal form.
- Digest::MD4
-
For n-grams hashing optimisation.
BUGS
Please report any bugs or feature requests to bug-Text-DeDuper@rt.cpan.org
, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SEE ALSO
Encode, Encode::Supported, Digest::MD4
- Andrei Z. Broder at al., Syntactic Clustering of the Web
-
http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html
Contains among other things definition of the resemblance measure.
AUTHOR
Jan Pomikalek, <xpomikal@fi.muni.cz>
COPYRIGHT & LICENSE
Copyright 2006 Jan Pomikalek, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 83:
=cut found outside a pod block. Skipping to next block.