NAME
minhash_cmp - uses MinHash & SpeedyFx to compare large text data
VERSION
version 0.014
SYNOPSIS
minhash_cmp [options] FILE1 FILE2
DESCRIPTION
MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.
OPTIONS
- --help
-
This.
- --epsilon
-
Expected error value used to compute the number of different hash functions (default: 0.05).
- --k
-
Number of different hash functions to use (default: 400; overrides
--epsilon
). - --seed
-
Custom seed (integer).
- --bits
-
How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes.
CAVEATS
Under bits=18
setting, each initialized hash function consumes ~500KB.
SEE ALSO
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2021 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.