NAME

Regex::PreSuf - create regular expressions from word lists

SYNOPSIS

use Regex::PreSuf;

my $re = presuf(qw(foobar fooxar foozap));

# $re should be now 'foo(?:zap|[bx]ar)'

DESCRIPTION

The presuf() subroutine builds regular expressions out of 'word lists', lists of strings. The regular expression matches the same words as the word list. These regular expressions normally run few dozen percentages faster than a simple-minded '|'-concatenation of the words.

Examples:

  • 'foobar fooxar' => 'foo[bx]ar'
  • 'foobar foozap' => 'foo(?:bar|zap)'
  • 'foobar fooar'  => 'foob?ar'

The downsides:

  • The original order of the words is not necessarily respected, for example because the character class matches are collected together, separate from the '|' alternations.

  • Because the module blithely ignores any specialness of any regular expression metacharacters such as the *?+{}[], please do not use them in the words, the resulting regular expression will most likely be highly illegal.

For the second downside there is an exception. The module has some rudimentary grasp of how to use the 'any character' metacharacter. If you call presuf() like this:

my $re = presuf({ anychar=>1 }, qw(foobar foo.ar fooxar));

# $re should be now 'foo.ar'

The module finds out the common prefixes and suffixes of the words and then recursively looks at the remaining differences. However, by default only common prefixes are used because for many languages (natural or artificial) this seems to produce the fastest matchers. To allow also for suffixes use

my $re = presuf({ suffixes=>1 }, ...);

To use only suffixes use

my $re = presuf({ prefixes=>0 }, ...);

(this implicitly enables suffixes)

Prefix and Suffix Length

Two auxiliary subroutines are optionally exportable. WARNING: strictly speaking these routines are mainly only intended for internal use of the module and their interface and existence is subject to change withour warning.

  • ($prefix_length, %diff_chars) = prefix_length(@word_list);

    prefix_length() gets a word list and returns the length of the prefix shared by all the words (such a prefix may not exist, making the length to be zero), and a hash that has as keys the characters that made the prefix to "stop". For example for qw(foobar fooxar) (2, 'b', ..., 'x', ...) will be returned.

  • ($suffix_length, %diff_chars) = suffix_length(@word_list);

    suffix_length() gets a word list and returns the length of the suffix shared by all the words (such a suffix may not exist, making the length to be zero), and a hash that has as keys the characters that made the suffix to "stop". For example for qw(foobar barbar) (3, 'o', ..., 'r', ...) will be returned.

Debugging

In case you want to flood your session without debug messages you can turn on debugging by saying

Regex::PreSuf::debug(1);

How to turn them off again is left as an exercise for the kind reader.

COPYRIGHT

Jarkko Hietaniemi

This code is distributed under the same copyright terms as Perl itself.