NAME

Text::SpeedyFx - tokenize/hash large amount of strings efficiently

VERSION

version 0.004

SYNOPSIS

use Data::Dumper;
use Text::SpeedyFx;

my $sfx = Text::SpeedyFx->new;

my $words_bag = $sfx->hash('To be or not to be?');
print Dumper $words_bag;
#$VAR1 = {
#          '1422534433' => '1',
#          '4120516737' => '2',
#          '1439817409' => '2',
#          '3087870273' => '1'
#        };

my $feature_vector = $sfx->hash_fv("thats the question", 5);
print Dumper $feature_vector;
#$VAR1 = [
#          '0',
#          '1',
#          '0',
#          '1',
#          '0'
#        ];

DESCRIPTION

XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.

Original implementation is in Java and was adapted for a better Unicode compliance.

METHODS

new([$seed])

Initialize parser/hasher, optionally using a specified $seed (default: 1).

hash($string)

Parses $string and returns a hash reference where keys are hashed tokens and values are respective count.

hash_fv($string, $n)

Parses $string and returns a feature vector with $n elements.

hash_min($string)

Parses $string and returns the hash with the lowest value.

REFERENCES

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.