NAME
String::Approx - match and substitute approximately (aka fuzzy matching)
SYNOPSIS
use String::Approx qw(amatch asubstitute);
DESCRIPTION
Approximate is defined here as k-differences. One difference is an insertion, a deletion, or a substitution of one character. The k in the k-differences is the maximum number of differences.
For example 1-difference means that a match is found if there is one character too many (insertion) or one character missing (deletion) or one character changed (substitution). Those are exclusive ors: that is, not one of each type of modification but exactly one.
The default approximateness
The default approximateness is 10 % of the length of the approximate pattern or at least 1: 0-differences being the exact matching which can be done very effectively using the usual Perl function index()
or normal regular expression matching.
amatch
use String::Approx qw(amatch);
amatch("PATTERN");
amatch("PATTERN", @LIST);
amatch("PATTERN", [ @MODS ]);
amatch("PATTERN", [ @MODS ], @LIST);
The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...
) will be understood as literal characters, that is, a *
means in regex terms \*
, not "match 0 or more times".
The LIST is the list of strings to match against the pattern. If no LIST is given matches against $_
.
The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: The syntax really is [ @MODS ]
, the square brackets [ ]
must be in there. See below for examples.
In scalar context amatch()
returns the number of successful matches. In list context amatch()
returns the strings that had matches.
Example:
use String::Approx qw(amatch);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if amatch('perl');
}
or the same ignoring case:
use String::Approx qw(amatch);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if amatch('perl', ['i']);
}
asubstitute
use String::Approx qw(asubstitute);
asubstitute("PATTERN", "SUBSTITUTION");
asubstitute("PATTERN", "SUBSTITUTION", @LIST);
asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ]);
asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ], @LIST);
The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...
) will be understood as literal characters, that is, a *
means in regex terms \*
, not "match 0 or more times".
Also the SUBSTITUTION is a string, not a regular expression. Well, mostly. Most of the regular expression metanotation (.
, ?
, *
, +
, ...) will be not understood as literal characters, that is, a *
means in regex terms \*
, not "match 0 or more times". The understood notations are
$`
-
the part before the approximate match
$&
-
the approximately matched part
$'
-
the part after the approximate match
The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: Yes, the syntax is really [ @MODS ]
, the square brackets [ ]
must be in there. See below for examples.
The LIST is the list of strings to substitute against the pattern. If no LIST is given substitutes against $_
.
In scalar context asubstitute()
returns the number of successful substitutions. In list context asubstitute()
returns the strings that had substitutions.
Examples:
use String::Approx qw(asubstitute);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if asubstitute('perl', '($&)');
}
or the same ignoring case:
use String::Approx qw(asubstitute);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if asubstitute('perl', '($&)', [ 'i' ]);
}
Modifiers
The MODS argument both in amatch() and asubstitute() is a list of strings that control the matching of PATTERN. The first two, i and g, are the usual regular expression match/substitute modifiers, the rest are special for approximate matching/substitution.
- i
-
Match/Substitute ignoring case, case-insensitively.
- g
-
Substitute globally, that is, all the approximate matches, not just the first one.
- k
-
The maximum number of differences. For example 2.
- Ik
-
The maximum number of insertions. For example 'I2'.
- Dk
-
The maximum number of deletions. For example 'D2'.
- Sk
-
The maximum number of substitutions. For example 'S2'.
- k%
-
The maximum relative number of differences. For example '10%'.
- Ik%
-
The maximum relative number of insertions. For example 'I5%'.
- Dk%
-
The maximum relative number of deletions. For example 'D5%'.
- Sk%
-
The maximum relative number of substitutions. For example 'S5%'.
The regular expression modifiers o m s x
are not supported because their definitions for approximate matching are less than clear.
The relative number of differences is relative to the length of the PATTERN, rounded up: if, for example, the PATTERN is 'bouillabaise'
and the MODS is ['20%']
the k becomes 3.
If you want to disable a particular kind of difference you need to explicitly set it to zero: for example 'D0'
allows no deletions.
In case of conflicting definitions the later ones silently override, for example:
[2, 'I3', 'I1']
equals
['I1', 'D2', 'S2']
EXAMPLES
The following examples assume the following template:
use String::Approx qw(amatch asubstitute);
open(WORDS, "/usr/dict/words") or die;
while (<WORDS>) {
# <---
}
and the following examples just replace the above '# <---
' line.
Matching from the $_
- Match 'perl' with one difference
-
print if amatch('perl');
The one difference is automatically the result in this case because first the rule of the 10 % of the length of the pattern ('
perl
') is used and then the at least 1 rule. - Match 'perl' with case ignored
-
print if amatch('perl', [ 'i' ]);
The case is ignored in matching (
i
). - Match 'perl' with one insertion
-
print if amatch('perl', [ '0', 'I1' ]);
The one insertion is easiest achieved with first disabling any approximateness (
0
) and then enabling one insertion (I1
). - Match 'perl' with zero deletions
-
print if amatch('perl', [ 'D0' ]);
The zero deletion is easily achieved with simply disabling any deletions (
D0
), the other types of differences, the insertions and substitutions, are still enabled. - Substitute 'perl' approximately with HTML emboldening
-
print if amatch('perl', '<B>$&</B>', [ 'g' ]);
All (
g
) of the approximately matching parts of the input are surrounded by theHTML
emboldening markup.
Matching from a list
The above examples match against the default variable $_. The rest of the examples show how the match from a list. The template is now:
use String::Approx qw(amatch asubstitute);
open(WORDS, "/usr/dict/words") or die;
@words = <words>;
# <---
and the examples still go where the '# <---
' line is.
- Match 'perl' with one difference from a list
-
@matched = amatch('perl', @words);
The
@matched
contains the elements of the@words
that matched approximately. - Substitute 'perl' approximately with HTML emphasizing from a list
-
@substituted = asubstitute('perl', '<EM>$&</EM>', [ 'g' ], @words);
The
@substituted
contains with all (g
) the substitutions the elements of the@words
that matched approximately.
ERROR MESSAGES
- amatch: $_ is undefined: what are you matching against?
- asubstitute: $_ is undefined: what are you matching against?
-
These happen when you have nothing in
$_
and try toamatch()
orasubstitute()
. Perhaps you are using the Perl option-e
but you did forget the Perl option-n
? - amatch: too long pattern.
-
This happens when the pattern is too long for matching.
When matching long patterns,
String::Approx
attempts to partition the match. In other words, it tries to do the matching incrementally in smaller parts.If this fails the above message is shown. Please try using shorter match patterns.
See below for "Pattern length" in LIMITATIONS for more detailed explanation why this happens.
- asubstitute: too long pattern.
-
This happens when the pattern is too long for substituting.
The partitioning scheme explained above that is used for matching long patterns cannot, sadly enough, be used substituting.
Please try using shorter substitution patterns.
See below for "Pattern length" in LIMITATIONS for more detailed explanation why this happens.
VERSION
Version 2.1.
LIMITATIONS
Fixed pattern
The PATTERNs of amatch()
and asubstitute()
are fixed strings, they are not regular expressions. The SUBSTITUTION of asubstitute()
is a bit more flexible than that but not by much.
Pattern length
The approximate matching algorithm is very aggressive. In mathematical terms it is O(exp(n) * x**2). This means that when the pattern length and/or the approximateness grows the matching or substitution take much longer time and memory.
For amatch()
this can be avoided by partitioning the pattern, matching it in shorter subpatterns. This makes matching a bit slower and a bit more fuzzier, more approximate. For asubstitute()
this partitioning cannot be done, the absolute maximum for the substitution pattern length is 19 but sometimes, for example it the approximateness is increased, even shorter patterns are too much. When this happens, you must use shorter patterns.
Speed
Despite the about 20-fold speed increase from the String::Approx
version 1 agrep is still faster. If you do not know what agrep
is: it is a program like the UNIX grep but it knows, among other things, how to do approximate matching. agrep
is still about 30 times faster than Perl + String::Approx
. NOTE: all these speeds were measured in one particular system using one particular set of tests: your mileage will vary.
For long patterns, more than about 40, the first
Incompatibilities with String::Approx
v1.*
If you have been using regular expression modifiers (i, g) you lose. Sorry about that. The syntax simply is not compatible. I had to choose between having amatch()
match and asubstitute()
substitute elsewhere than just in $_ and the old messy way of having an unlimited number of modifiers. The first need won.
There is a backward compability mode, though, if you do not want to change your amatch()
and asubstitute()
calls. You have to change your use
line, however:
use String::Approx qw(amatch compat1);
That is, you must add the compat1
symbol if you want to be compatible with the String::Approx
version 1 call syntax.
AUTHOR
Jarkko Hietaniemi <jhi@iki.fi>
ACKNOWLEDGEMENTS
Nathan Torkington <gnat@frii.com>