NAME
Text::Bloom - Evaluate Bloom signature of a set of terms
SYNOPSIS
my $b = Text::Bloom->new();
$b->Compute( qw( foo bar baz ) );
my $sig = $b->WriteToString();
$b->WriteToFile( 'afile.sig' );
my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
my $b3 = Text::Bloom->new();
$b3->Compute( qw( foo bar barbaz ) );
my $sim = $b->Similarity( $b2 );
my $b4 = Text::Bloom::NewFromString( $sig );
DESCRIPTION
Text::Bloom
applies the Bloom filtering technique to the statistical analysis of documents.
The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.
Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV
function.
Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.
The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.
The signature may be used for:
testing whether a given set of terms is present in the document,
computing which fraction of terms are common to two documents.
The bit representation may be written to and read from a file. Text::Bloom
prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib
is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.
The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[0-9a-z]+/
.
There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV
.
FORESEEN REUSE
The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a V
suffix. Redefining other methods will require greater attention.
CLASS METHODS
new
The constructor. No arguments are required.
$b = Text::Bloom->new();
NewFromString
Take a string written by WriteToString
(see below) and create a new Text::Bloom
with the same contents; call die
whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.
my $b = Text::Bloom::NewFromString( $str );
The return value is a blessed reference; put in another way, this is an alternative contructor.
The string should have been written by WriteToString
; you may of course tweak the string contents, but at this point you're entirely on you own.
NewFromFile
Utility function that reads a binary file and performs a NewFromString
on its content; see its counterpart, WriteToFile
.
my $b2 = Text::Document::NewFromFile( 'foo.sig' );
INSTANCE METHODS
Size
Set and get the size of the filter, in bits. The default size is currently 128K.
print 'size is ' . $b->Size() . "\n";
$b->Size( 65536 );
The Size
method must be called before the Compute
method in order to have effect.
Compute
Compute the Bloom signature from the given set of words and store it internally.
$b->Compute( qw( foo bar baz foobar bazbaz ) );
Makes use of the QuantizeV
method.
QuantizeV
Convert a term into an integer; must return an integer in the range 0 .. $Text::Bloom::p-1
.
It is called as
my $hash = $b->QuantizeV( $term );
The current version is designed for strings matching /[a-z0-9]+/
. Other characters do not cause errors, but degrade the hash function performance.
This function is a likely candidate for redefinition.
HashV
Convert an integer to a (smaller) integer, according to one of a class of similar functions.
It is internally called as:
my $index = $b->HashV( $order, $value );
The $value
must belong to the interval 0..$Text::Bloom::p-1
, while the index must lie in 0..size-1. $order
is a small integer from 0 to d-1.
The default implementation is
index = m[order] * value + q[order] (mod size)
the values of m and q are taken from the array @Text::Bloom::hashParam
; the form of the function is taken from [2].
WriteToString
Convert the Bloom signature into a string which can be saved and later restored with NewFromString
. Compute
must have been called previously.
my $str = $b->WriteToString();
The string begins with a header which encodes the originating package, its version, the parameters of the current instance.
Whenever possible, Compress::Zlib
is used in order to compress the bit vector in the most efficient way. On systems without Compress::Zlib
, the bit string is saved uncompressed.
WriteToFile
These convenience functions just call their String counterparts and read/write the file specified in the argument.
$b->WriteToFile( 'foo.sig' );
AUTHORS
spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)
BIBLIOGRAPHY
- [1]
-
Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM, 13, 7, July 1970, pages 422-426. (available electronically from ACM Digital Library).
- [2]
-
M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel Free-Text Searching", Communications of the ACM, 32, 10, October 1989, pages 1237-1239. (available electronically from ACM Digital Library).
BUGS
On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence.
HISTORY
2001-11-02 - initial revision