NAME

Text::SpeedyFx - tokenize/hash large amount of strings efficiently

VERSION

version 0.005

SYNOPSIS

use Data::Dumper;
use Text::SpeedyFx;

my $sfx = Text::SpeedyFx->new;

my $words_bag = $sfx->hash('To be or not to be?');
print Dumper $words_bag;
#$VAR1 = {
#          '1422534433' => '1',
#          '4120516737' => '2',
#          '1439817409' => '2',
#          '3087870273' => '1'
#        };

my $feature_vector = $sfx->hash_fv("thats the question", 8);
print unpack('b*', $feature_vector);
# 01001000

DESCRIPTION

XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.

Original implementation is in Java and was adapted for a better Unicode compliance.

METHODS

new([$seed])

Initialize parser/hasher, optionally using a specified $seed (default: 1).

hash($string)

Parses $string and returns a hash reference where keys are the hashed tokens and values are their respective count. Note that this is the slowest form due to the (computational) complexity of the Perl hash structure itself: hash_fv() is 147% faster, while hash_min() is 175% faster.

hash_fv($string, $n)

Parses $string and returns a feature vector (string of bits) with length $n. $n is supposed to be a multiplier of 8, as the length of the resulting feature vector is ceil($n / 8). Feature vector format can be useful in Bloom filter implementation, for instance.

hash_min($string)

Parses $string and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility.

REFERENCES

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Text::SpeedyFx, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::SpeedyFx

CPAN shell

perl -MCPAN -e shell
install Text::SpeedyFx

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)