NAME

Algorithm::RabinKarp - rabin-karp streaming hash

SYNOPSIS

my $text = "A do run run run, a do run run";
my $kgram = Algorithm::RabinKarp->new($window, $text);

or

my $kgram2 = Algorithm::RabinKarp->new($window, $fh);

or my $kgram3 = Algorithm::RabinKarp->new($window, sub { ... return $num, $position; });

my ($first, $start_position, $end_position) = $kgram->next;

my @values = $kgram->values;

my %occurances; # a dictionary of all kgrams.
while (my ($hash, @pos) = @{shift @values}) {
  push @{$occurances{$hash}}, \@pos; 
}

my $needle = Algorithm::RabinKarp->new(6, "needle");
open my $fh, '<', "haystack.txt";
my $haystack = Algorithm::RabinKarp->new(6, $fh);
my $needle_hash = $needle->next;

while (my ($hay_hash, @pos) = $haystack->next) {
  warn "Possible match for 'needle' at @pos" 
    if $needle_hash eq $hay_hash;
}

DESCRIPTION

This is an implementation of Rabin and Karp's streaming hash, as described in "Winnowing: Local Algorithms for Document Fingerprinting" by Schleimer, Wilkerson, and Aiken. Following the suggestion of Schleimer, I am using their second equation:

$H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k

The results of this hash encodes information about the next k values in the stream (hense k-gram.) This means for any given stream of length n integer values (or characters), you will get back n - k + 1 hash values.

For best results, you will want to create a code generator that filters your data to remove all unnecessary information. For example, in a large english document, you should probably remove all white space, as well as removing all capitalization.

METHODS

new($k, [FileHandle|Scalar|Coderef] )

Creates a new hash generator. If you provide a callback function, it must return the next integer value in the stream. Additionally, you may return the original position of the value in the stream (ie, you may have been filtering characters out because they're redundant.)

next()

Returns an array of three values for each call. The first element is the k-gram hash value. The second and third elements are the start and end positions, inclusive, as provided by the generator stream.

next() requires $k iterations to warm up on first call (don't worry, you don't need to remember to do that, it's handled internally.) Each successive call to next() has a complexity of O(1).

values

Returns an array containing all n - k + 1 hash values contained within the data stream, and the positions associated with them (in the same format as yielded by next.)

After calling values() the stream will be completely exhausted, causing subsequent calls to values and next() to return undef.

NOTE: You should use next if the stream you are generating hash codes for is infinite. Failure to do so will yield unexpected results.

BUGS

No explicit guards against overflow have been taken, nor any attempts to use clever bitwise operators. I recommend either using small values of k, or including a 'use bignum' line before before iterating with larger ks. In a future version of this module, I will shift to using modulo arithmetic, which will not have the same sort of overflow problems.

SEE ALSO

"Winnowing: Local Algorithms for Document Fingerprinting"
L<http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf>

AUTHOR

Norman Nunley E<lt>nnunley@gmail.comE<gt>
Nicholas Clark (Who paired with me)