NAME

Text::SpeedyFx - tokenize/hash large amount of strings efficiently

VERSION

version 0.005

SYNOPSIS

use Data::Dumper;
use Text::SpeedyFx;

my $sfx = Text::SpeedyFx->new;

my $words_bag = $sfx->hash('To be or not to be?');
print Dumper $words_bag;
#$VAR1 = {
#          '1422534433' => '1',
#          '4120516737' => '2',
#          '1439817409' => '2',
#          '3087870273' => '1'
#        };

my $feature_vector = $sfx->hash_fv("thats the question", 8);
print unpack('b*', $feature_vector);
# 01001000

DESCRIPTION

XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.

Original implementation is in Java and was adapted for a better Unicode compliance.

METHODS

new([$seed])

Initialize parser/hasher, optionally using a specified $seed (default: 1).

hash($string)

Parses $string and returns a hash reference where keys are the hashed tokens and values are their respective count. Note that this is the slowest form due to the (computational) complexity of the Perl hash structure itself: hash_fv() is 147% faster, while hash_min() is 175% faster.

hash_fv($string, $n)

Parses $string and returns a feature vector (string of bits) with length $n. $n is supposed to be a multiplier of 8, as the length of the resulting feature vector is ceil($n / 8). Feature vector format can be useful in Bloom filter implementation, for instance.

hash_min($string)

Parses $string and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility.

REFERENCES

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.