NAME

Text::Prefix::XS - Fast prefix searching

SYNOPSIS

use Text::Prefix::XS;
my @haystacks = qw(
    garbage
    blarrgh
    FOO
    meh
    AA-ggrr
    AB-hi!
);

my @needles = qw(AAA AB FOO FOO-BAR);

my $search = prefix_search_create( map uc($_), @needles );

my %seen_hash;

foreach my $haystack (@haystacks) {
    if(my $prefix = prefix_search($search, $haystack)) {
        $seen_hash{$prefix}++;
    }
}

$seen_hash{'FOO'} == 1;

#Compare to:
my $re = join('|', map quotemeta $_, @needles);
$re = qr/^($re)/;

foreach my $haystack (@haystacks) {
    my ($match) = ($haystack =~ $re);
    if($match) {
        $seen_hash{$match}++;
    }
}
$seen_hash{'FOO'} == 1;

DESCRIPTION

This module implements something of an trie algorithm for matching (and extracting) prefixes from text strings.

A common application I had was to pre-filter lots and lots of text for a small amount of preset prefixes.

Interestingly enough, the quickest solution until I wrote this module was to use a large regular expression (as in the synopsis)

FUNCTIONS

The interface is relatively simple. This is alpha software and the API is subject to change

prefix_search_create(@prefixes)

Create an opaque prefix search handle. It returns a thingy, which you should keep around.

Internally it will order the elements in the list, with the longest prefix being first.

It will then construct a search trie using a variety of caching and lookup layers.

prefix_search($thingy, $haystack)

Will check $haystack for any of the prefixes in @needles passed to "prefix_search_create". If $haystack has a prefix, it will be returned by this function; otherwise, the return value is undef

PERFORMANCE

In most normal use cases, Text::Prefix::XS will outperform any other module or search algorithm.

Specifically, this module is intended for a pessimistic search mechanism, where most of the input is assumed not to match (which is usually the case anyway).

The ideal position of Text::Prefix::XS would reside between raw but delimited user input, and more complex searching and processing algorithms. This module acts as a layer between those.

In addition to a trie, this module also uses a very fast sparse array to check characters in the input against an index of known characters at the given position. This is much quicker than a hash lookup.

See the trie.pl script included with this distribution for detailed benchmark comparison methods

Here are a bunch of numbers. The entries are in the format of

[capture (Y/N)] NAME DURATION MATCHES

Where capture means whether the test was also able to return the prefix which matched. MATCHES is the amount of matches returned.

Additionally, each test has a few parameters defining the input. These are:

TERMS: The amount of search terms
TERM_MIN: The minimum length of a term
TERM_MAX: The maximum length of a term
INPUT: The count of input strings which will be checked to see if they are prefixed with any of the TERMS. The strings are each exactly one character longer than TERM_MAX

Sample input is taken by making a sha1_hex string of each number from 0 until TERMS, and then encoding that output into Base64, ensuring that both the terms and the input get a diversity of the ASCII charset.

A few methods were benchmarked, and are listed as keys:

Perl-Trie: A generic implementation of a search trie in pure perl
TMFA: Text::Match::FastAlternatives match_at function
perl-re: Generic perl regex. The capturing version is qr/^(term1|term2)/, and the non-capturing version is qr/^(?:term1|term2)/, where the terms are joined together in a list2re fashion.
RE2: Same as perl-re, except using re::engine::RE2
TXS: This module.

Generated INPUT=2000000 TERMS=20 TERM_MIN=3 TERM_MAX=6
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	2.42s	M=34768
[N] TMFA       	1.11s	M=34768
[N] perl-re    	1.27s	M=34768
[N] RE2        	0.96s	M=34768
[Y] perl-re    	1.45s	M=34768
[Y] RE2        	2.95s	M=34768
[Y] TXS        	0.53s	M=34768

Generated INPUT=2000000 TERMS=50 TERM_MIN=10 TERM_MAX=16
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	2.32s	M=50
[N] TMFA       	1.07s	M=50
[N] perl-re    	1.27s	M=50
[N] RE2        	0.93s	M=50
[Y] perl-re    	1.47s	M=50
[Y] RE2        	1.14s	M=50
[Y] TXS        	0.55s	M=50

Generated INPUT=2000000 TERMS=49 TERM_MIN=2 TERM_MAX=16
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	17.70s	M=420699
[N] TMFA       	1.10s	M=420699
[N] perl-re    	1.35s	M=420699
[N] RE2        	0.97s	M=420699
[Y] perl-re    	1.70s	M=420699
[Y] RE2        	4.98s	M=420699
[Y] TXS        	1.62s	M=420699


Generated INPUT=2000000 TERMS=10 TERM_MIN=5 TERM_MAX=10
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	1.99s	M=265
[N] TMFA       	1.07s	M=265
[N] perl-re    	1.20s	M=265
[N] RE2        	0.90s	M=265
[Y] perl-re    	1.43s	M=265
[Y] RE2        	2.70s	M=265
[Y] TXS        	0.45s	M=265

Generated INPUT=2000000 TERMS=100 TERM_MIN=3 TERM_MAX=25
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	10.35s	M=22269
[N] TMFA       	1.15s	M=22269
[N] perl-re    	1.40s	M=22269
[N] RE2        	1.06s	M=22269
[Y] perl-re    	1.58s	M=22269
[Y] RE2        	2.06s	M=22269
[Y] TXS        	1.10s	M=22269

Generated INPUT=2000000 TERMS=200 TERM_MIN=5 TERM_MAX=25
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	3.81s	M=1325
[N] TMFA       	1.16s	M=1325
[N] perl-re    	1.30s	M=1325
[N] RE2        	1.07s	M=1325
[Y] perl-re    	1.56s	M=1325
[Y] RE2        	1.38s	M=1325
[Y] TXS        	0.63s	M=1325

Generated INPUT=2000000 TERMS=8 TERM_MIN=2 TERM_MAX=5
CAP   NAME       DUR	MATCH
[Y] Perl-Trie  	1.79s	M=22168
[N] TMFA       	1.15s	M=22168
[N] perl-re    	1.26s	M=22168
[N] RE2        	1.01s	M=22168
[Y] perl-re    	1.56s	M=22168
[Y] RE2        	2.48s	M=22168
[Y] TXS        	0.49s	M=22168

I've mainly tested this on Debian's 5.10 - for newer perls, this module performs better, and for el5 5.8, The differences are a bit lower. TBC

CAVEATS

I have yet to figure out a way to test this properly with threads. Currently the trie data structure is stored as a private perl HV, and I'm not sure what happens when it's cloned across threads.

This algorithm performs quite poorly when matches are more likely. HOWEVER, in the case where there is a desire to extract the matched prefix, the overhead in doing so with Regular Expressions outweighs the performance hit of Text::Prefix::XS, making Text::Prefix::XS still effectively faster.

Search prefixes and search input is currently restricted to printable ASCII characters

Search terms may not exceed 256 characters. You can increase this limit (at the cost of more memory) by changing the #define of CHARTABLE_MAX in the XS code and recompiling.

AUTHOR AND COPYRIGHT

You may use and distribute this software under the same terms, conditions, and licensing as Perl itself.

To install Text::Prefix::XS, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Prefix::XS

CPAN shell

perl -MCPAN -e shell
install Text::Prefix::XS

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)