NAME

String::Approx - match and substitute approximately (aka fuzzy matching)

SYNOPSIS

use String::Approx qw(amatch asubstitute aregex);

DESCRIPTION

Approximate is defined here as k-differences. One difference is an insertion, a deletion, or a substitution of one character. The k in the k-differences is the maximum number of differences.

For example 1-difference means that a match is found if there is one character too many (insertion) or one character missing (deletion) or one character changed (substitution). Those are exclusive ors: that is, not one of each type of modification but exactly one.

The default approximateness

The default approximateness is 10 % of the length of the approximate pattern or at least 1: 0-differences being the exact matching which can be done very effectively using the usual Perl function index() or normal regular expression matching.

amatch

use String::Approx qw(amatch);

amatch("PATTERN");
amatch("PATTERN", @LIST);
amatch("PATTERN", [ @MODS ]);
amatch("PATTERN", [ @MODS ], @LIST);

The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...) will be understood as literal characters, that is, a * means in regex terms \*, not "match 0 or more times".

The LIST is the list of strings to match against the pattern. If no LIST is given matches against $_.

The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: The syntax really is [ @MODS ], the square brackets [ ] must be in there and it is not a string, no quotes of any kind around the [ ]. It is an anonymous array, see perlref. See below for MODS examples.

In scalar context amatch() returns the number of successful matches. In list context amatch() returns the strings that had matches.

Example:

use String::Approx qw(amatch);

open(WORDS, '/usr/dict/words') or die;

while (<WORDS>) {
    print if amatch('perl');
}

or the same ignoring case:

use String::Approx qw(amatch);

open(WORDS, '/usr/dict/words') or die;

while (<WORDS>) {
    print if amatch('perl', ['i']);
}

asubstitute

use String::Approx qw(asubstitute);

asubstitute("PATTERN", "SUBSTITUTION");
asubstitute("PATTERN", "SUBSTITUTION", @LIST);
asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ]);
asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ], @LIST);

The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...) will be understood as literal characters, that is, a * means in regex terms \*, not "match 0 or more times".

Also the SUBSTITUTION is a string, not a regular expression. Well, mostly. Most of the regular expression metanotation (., ?, *, +, ...) will be not understood as literal characters, that is, a * means in regex terms \*, not "match 0 or more times". The understood notations are

$`

the part before the approximate match

$&

the approximately matched part

$'

the part after the approximate match

The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: Yes, the syntax is really [ @MODS ], the square brackets [ ] must be in there and it is not a string, no quotes of any kind around the [ ]. It is an anonymous array, see perlref. See below for MODS examples.

The LIST is the list of strings to substitute against the pattern. If no LIST is given substitutes against $_.

In scalar context asubstitute() returns the number of successful substitutions. In list context asubstitute() returns the strings that had substitutions.

Examples:

use String::Approx qw(asubstitute);

open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
    print if asubstitute('perl', '($&)');
}

or the same ignoring case:

use String::Approx qw(asubstitute);

open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
    print if asubstitute('perl', '($&)', [ 'i' ]);
}

aregex

use String::Approx qw(aregex);

aregex("PATTERN");
aregex("PATTERN", [ @MODS ]);

Always returns a list: the list of regular expressions that can be used to approximately match the wanted pattern, given the possible modifiers. If the pattern is long or the modifiers call for large approximateness, the list will have more than one element. How to use the element past the first one, is dependent on the intended application of the regular expressions.

my (@regex) = aregex('Obi-wan Kenobi');
print $regex[0], "\n";
print $regex[1], "\n";

If the pattern is to be passed onto an external utility such as grep(1) the std modifier may be useful. If that is applied, the Perlisms (?:) and <(?=)> are avoided.

my (@regex) = aregex('Obi-wan Kenobi', [ 'std' ]);

Modifiers

The MODS argument in amatch(), asubstitute(), and aregex(), is an anonymous array (see perlref) of strings that control the matching of PATTERN. The first three modifiers, i, g, and ?, are the usual regular expression match/substitute modifiers, the rest are special for approximate matching/substitution.

i

Match/Substitute ignoring case, case-insensitively.

g

Substitute globally, that is, all the approximate matches, not just the first one.

?

Stingy match: instead of matching greedily, as much as possibly, match as little as possible.

std

Used usually only with the aregex() interface. Uses the () instead of the (?:) and does not use the ?= in the result pattern. Useful if the pattern is going to be fed to an external tool that does not understand these Perl extensions. Produces shorter regular expressions -- but they might match slower.

k

The maximum number of differences. For example 2.

Ik

The maximum number of insertions. For example 'I2'.

Dk

The maximum number of deletions. For example 'D2'.

Sk

The maximum number of substitutions. For example 'S2'.

k%

The maximum relative number of differences. For example '10%'.

Ik%

The maximum relative number of insertions. For example 'I5%'.

Dk%

The maximum relative number of deletions. For example 'D5%'.

Sk%

The maximum relative number of substitutions. For example 'S5%'.

The regular expression modifiers o m s x are not supported because their definitions for approximate matching are less than clear.

The relative number of differences is relative to the length of the PATTERN, rounded up: if, for example, the PATTERN is 'bouillabaise' and the MODS is ['20%'] the k becomes 3.

If you want to disable a particular kind of difference you need to explicitly set it to zero: for example 'D0' allows no deletions.

In case of conflicting definitions the later ones silently override, for example:

[2, 'I3', 'I1']

equals

['I1', 'D2', 'S2']

EXAMPLES

The following examples assume the following template:

use String::Approx qw(amatch asubstitute);

open(WORDS, "/usr/dict/words") or die;
while (<WORDS>) {
	# <---
}

and the following examples just replace the above '# <---' line.

Matching from the $_

Match 'perl' with one difference
print if amatch('perl');

The one difference is automatically the result in this case because first the rule of the 10 % of the length of the pattern ('perl') is used and then the at least 1 rule.

Match 'perl' with case ignored
print if amatch('perl', [ 'i' ]);

The case is ignored in matching (i). NOTE: this option halves the speed.

Match 'perl' with one insertion
print if amatch('perl', [ '0', 'I1' ]);

The one insertion is easiest achieved with first disabling any approximateness (0) and then enabling one insertion (I1).

Match 'perl' with zero deletions
print if amatch('perl', [ 'D0' ]);

The zero deletion is easily achieved with simply disabling any deletions (D0), the other types of differences, the insertions and substitutions, are still enabled.

Stingy matching
print if amatch('perl', [ '?' ]);

Match stingily: as little is matched as possible, as opposed to the default greedy matching, where as much is matched as possible.

Substitute 'perl' approximately with HTML emboldening
print if asubstitute('perl', '<B>$&</B>', [ 'g' ]);

All (g) of the approximately matching parts of the input are surrounded by the HTML emboldening markup.

Stingy substitution
print if asubstitute('perl', '<B>$&</B>', [ '?' ]);

Substitution is now stingy: as little is substituted as possible, as opposed to the default greedy substitution, where as much is substituted as possible. When stingy the '$&' naturally tends to be shorter than when greedy -- and the '&`' and the '$'' respectively longer.

Matching from a list

The above examples match against the default variable $_. The rest of the examples show how the match from a list. The template is now:

use String::Approx qw(amatch asubstitute);

open(WORDS, "/usr/dict/words") or die;
@words = <words>;
# <---

and the examples still go where the '# <---' line is.

Match 'perl' with one difference from a list
@matched = amatch('perl', @words);

The @matched contains the elements of the @words that matched approximately.

Substitute 'perl' approximately with HTML emphasizing from a list
@substituted = asubstitute('perl', '<EM>$&</EM>', [ 'g' ], @words);

The @substituted contains with all (g) the substitutions the elements of the @words that matched approximately.

DIAGNOSTICS

amatch: $_ is undefined: what are you matching against?
asubstitute: $_ is undefined: what are you matching against?

These happen when you have nothing in $_ and try to amatch() or asubstitute(). Perhaps you are using the Perl option -e but you did forget the Perl option -n?

amatch: too long pattern.

This happens when the pattern is too long for matching.

When matching long patterns, String::Approx attempts to partition the match. In other words, it tries to do the matching incrementally in smaller parts.

If this fails the above message is shown. Please try using shorter match patterns.

See below for "Pattern length" in LIMITATIONS for more detailed explanation why this happens.

asubstitute: too long pattern.

This happens when the pattern is too long for substituting.

The partitioning scheme explained above that is used for matching long patterns cannot, sadly enough, be used substituting.

Please try using shorter substitution patterns.

See below for "Pattern length" in LIMITATIONS for more detailed explanation why this happens.

TIPS

transposes

To match transposed letters (as in "trasnposed") use substitutions = 2.

case ignorance

Avoid this, the speed is halved.

stingy matching

Also this tends to be somewhat slower.

VERSION

Version 2.4.

LIMITATIONS

Fixed pattern

The PATTERNs of amatch() and asubstitute() are fixed strings, they are not regular expressions. The SUBSTITUTION of asubstitute() is a bit more flexible than that but not by much.

Pattern length

The used approximate matching algorithm is very aggressive. In mathematical terms it is O(exp(n) * k**2). This means that when the pattern length and/or the approximateness grows the matching or substitution take much longer time and memory.

For amatch() this can be avoided by partitioning the pattern, matching it in shorter subpatterns. This makes matching a bit slower and a bit more fuzzier, more approximate. For asubstitute() this partitioning cannot be done, the absolute maximum for the substitution pattern length is 19 but sometimes, for example it the approximateness is increased, even shorter patterns are too much. When this happens, you must use shorter patterns.

Speed

We are still now at release 2.3 about 100 times slower than agrep.

If you do not know what agrep is: it is a program like the UNIX grep for searching text from within files but it knows, among other things, how to do approximate matching. NOTE: all these speeds were measured in one particular system using one particular set of tests: your mileage will vary.

Incompatibilities with String::Approx v1.*

If you have been using regular expression modifiers (i, g) you lose. Sorry about that. The syntax simply is not compatible. I had to choose between having amatch() match and asubstitute() substitute elsewhere than just in $_ and the old messy way of having an unlimited number of modifiers. The first need won.

There is a backward compability mode, though, if you do not want to change your amatch() and asubstitute() calls. You have to change your use line, however:

use String::Approx qw(amatch compat1);

That is, you must add the compat1 symbol if you want to be compatible with the String::Approx version 1 call syntax.

AUTHOR

Jarkko Hietaniemi <jhi@iki.fi>

ACKNOWLEDGEMENTS

Alberto Fontaneda C<E<lt>alberfon@ctv.esE<gt>>
Dmitrij Frishman C<E<lt>frishman@mips.biochem.mpg.deE<gt>>
Lars Gregersen C<E<lt>lars.gregersen@private.dkE<gt>>
Helmut Jarausch C<E<lt>jarausch@IGPM.Rwth-Aachen.DEE<gt>>
Håkan Kjellerstrand C<E<lt>hakank@netch.seE<gt>>
Nathan Torkington C<E<lt>gnat@frii.comE<gt>>

Alberto Fontaneda and Dmitrij Frishman found a bug in long patterns, suggested a test, and tested the patch.

Lars Gregersen saw String::Approx 2.2 and Håkan Kjellerstrand's MakeRegex and that moment he experienced a genuine Eureka. The result: up to thirty times faster String::Approx. (MakeRegex was used for completely other purposes).

Helmut Jarausch notice that 2.3 asubst() failed its test case in 5.004_50+.

Nathan Torkington is to blame for the new API of release 2. :-)

1 POD Error

The following errors were encountered while parsing the POD:

Around line 487:

Non-ASCII character seen before =encoding in 'Håkan'. Assuming CP1252