NAME
Text::SpeedyFx - tokenize/hash large amount of strings efficiently
VERSION
version 0.005
SYNOPSIS
use Data::Dumper;
use Text::SpeedyFx;
my $sfx = Text::SpeedyFx->new;
my $words_bag = $sfx->hash('To be or not to be?');
print Dumper $words_bag;
#$VAR1 = {
# '1422534433' => '1',
# '4120516737' => '2',
# '1439817409' => '2',
# '3087870273' => '1'
# };
my $feature_vector = $sfx->hash_fv("thats the question", 8);
print unpack('b*', $feature_vector);
# 01001000
DESCRIPTION
XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.
Original implementation is in Java and was adapted for a better Unicode compliance.
METHODS
new([$seed])
Initialize parser/hasher, optionally using a specified $seed
(default: 1).
hash($string)
Parses $string
and returns a hash reference where keys are the hashed tokens and values are their respective count. Note that this is the slowest form due to the (computational) complexity of the Perl hash structure itself: hash_fv()
is 147% faster, while hash_min()
is 175% faster.
hash_fv($string, $n)
Parses $string
and returns a feature vector (string of bits) with length $n
. $n
is supposed to be a multiplier of 8, as the length of the resulting feature vector is ceil($n / 8)
. Feature vector format can be useful in Bloom filter implementation, for instance.
hash_min($string)
Parses $string
and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility.
REFERENCES
Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.