NAME
String::Approx - Perl extension for approximate matching (fuzzy matching)
SYNOPSIS
use String::Approx 'amatch';
print if amatch("foobar");
@matches = amatch("xyzzy", @inputs);
@matches = amatch("plugh", ['2'], @inputs);
DESCRIPTION
String::Approx lets you match and substitute strings approximately. With approximateness you can emulate errors: typing errorrs, speling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot).
The measure of approximateness is the Levenshtein edit distance. It is the total number of "edits": insertions,
word world
deletions,
monkey money
and substitutions
sun fun
required to transform a string to another string. For example, to transform "lead" to "gold", you need three edits:
lead gead goad gold
The edit distance of "lead" and "gold" is therefore three.
MATCH
use String::Approx 'amatch';
amatch("pattern")
amatch("pattern", @inputs)
amatch("pattern", [ modifiers ])
amatch("pattern", [ modifiers ], @inputs)
Match pattern approximately. In list context return the matched @inputs. If no inputs are given, match against the $_. In scalar context return true if any of the inputs match, false if none match.
Notice that the pattern is a string. Not a regular expression. None of the regular expression notations (^, ., *, and so on) work. They are characters just like the others.
MODIFIERS
With the modifiers you can control the amount of approximateness and certain other variables. The modifiers are one or more strings, for example "i". The modifiers are inside an anonymous array: the [ ] in the syntax are not notational, they really do mean [ ], for example [ "i", "2" ]. ["2 i"] would be identical.
The implicit default approximateness is 10%, rounded up. In other words: every tenth character in the pattern may be an error, an edit. You can explicitly set the maximum approximateness by supplying a modifier like
number
number%
Examples: "3", "15%".
Using a similar syntax you can separately control the maximum number of insertions, deletions, and substitutions by prefixing the numbers with I, D, or S, like this:
Inumber
Inumber%
Dnumber
Dnumber%
Snumber
Snumber%
Examples: "I2", "D20%", "S0".
You can ignore case ("A" becames equal to "a" and vice versa) by adding the "i" modifier.
For example
[ "i 25%", "S0" ]
means ignore case, allow every fourth character to be "an edit", but allow no substitutions. (See NOTES about disallowing substitutions or insertions.)
SUBSTITUTE
use String::Approx 'asubstitute';
asubstitute("pattern", "replacement")
asubstitute("pattern", "replacement", @inputs)
asubstitute("pattern", "replacement", [ modifiers ])
asubstitute("pattern", "replacement", [ modifiers ], @inputs)
Substitute approximate pattern with replacement and return copies of @inputs, the substitutions having been made on the elements that did match the pattern. If no inputs are given, substitute in the $_. The replacement can contain magic strings $&, $`, $' that stand for the matched string, the string before it, and the string after it, respectively. All the other arguments are as in amatch()
, plus one additional modifier, "g"
which means substitute globally (all the matches in an elemnt and not just the first one, as is the default).
See "BAD NEWS" about the unfortunate stinginess of asubstitute()
.
NOTES
Because matching is by substrings, not by whole strings, insertions and substitutions produce often very similar results: "abcde" matches "axbcde" either by insertion or substitution of "x".
The maximum edit distance is also the maximum number of edits. That is, the "I2"
in
amatch("abcd", ["I2"])
is useless because the maximum edit distance is (implicitly) 1. You may have meant to say
amatch("abcd", ["2D1S1"])
or something like that.
If you want to simulate transposes
feet fete
you need to allow at least edit distance of two because in terms of our edit primitives a transpose is one deletion and one insertion.
VERSION
3.0.
CHANGES FROM VERSION 2
GOOD NEWS
- The version 3 is 2-3 times faster than version 2
- No pattern length limitation
-
The algorithm is independent on the pattern length: its time complexity is O(kn), where k is the number of edits and n the length of the text (input).
BAD NEWS
- You do need a C compiler to install the module
-
Perl's regular expressions are no more used; instead a faster and more scalable algorithm written in C is used.
asubstitute()
is now always stingy-
The string matched and substituted is now always stingy, as short as possible. It used to be as long as possible. This is an unfortunate change stemming from switching the matching algorithm. Example: with edit distance of two and substituting for
"word"
from"cork"
and"wool"
previously did match"cork"
and"wool"
. Now it does match"or"
and"wo"
. As little as possible, or, in other words, with as much approximateness, as many edits, as possible. Because there is no need to match the"c"
of"cork"
, it is not matched. - no more
aregex()
because regular expressions are no more used - no more
compat1
for String::Approx version 1 compatibility
ACKNOWLEDGEMENTS
The following people provided with valuable test cases and other feedback.
Jared August <rudeop@skapunx.net>
Steve A. Chervitz <sac@genome.Stanford.edu>
Alberto Fontaneda <alberfon@ctv.es>
Dmitrij Frishman <frishman@mips.biochem.mpg.de>
Lars Gregersen <lars.gregersen@private.dk>
Kevin Greiner <kgreiner@geosys.com>
Ricky Houghton <ricky.houghton@cs.cmu.edu>
Helmut Jarausch <jarausch@IGPM.Rwth-Aachen.DE>
Sergey Novoselov <snovo@usa.net>
Stewart Russell <stewart@ref.collins.co.uk>
Slaven Rezic <eserte@cs.tu-berlin.de>
Ilya Sandler <sandler@etak.com>
Bob J.A. Schijvenaars <schijvenaars@mi.fgg.eur.nl>
Greg Ward <greg@bic.mni.mcgill.ca>
Rick Wise <rwise@lcc.com>
AUTHOR
Jarkko Hietaniemi <jhi@iki.fi>