NAME
String::Approx - match and substitute approximately (aka fuzzy matching)
SYNOPSIS
use String::Approx qw(amatch asubstitute);
DESCRIPTION
Approximate is defined here as k-differences. One difference is an insertion, a deletion, or a substitution of one character. The k in the k-differences is the maximum number of differences.
For example 1-difference means that a match is found if there is one character too many (insertion) or one character missing (deletion) or one character changed (substitution). Those are exclusive ors: that is, not one of each type of modification but exactly one.
The default approximateness
The default approximateness is 10 % of the length of the approximate pattern or at least 1: 0-differences being the exact matching which can be done very effectively using the usual Perl function index()
or normal regular expression matching.
amatch
use String::Approx qw(amatch);
amatch("PATTERN");
amatch("PATTERN", @LIST);
amatch("PATTERN", [ @MODS ]);
amatch("PATTERN", [ @MODS ], @LIST);
The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...
) will be understood as literal characters, that is, a *
means in regex terms \*
, not "match 0 or more times".
The LIST is the list of strings to match against the pattern. If no LIST is given matches against $_
.
The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: The syntax really is [ @MODS ]
, the square brackets [ ]
must be in there. See below for examples.
In scalar context amatch()
returns the number of successful substitutions. In list context amatch()
returns the strings that had substitutions.
Example:
use String::Approx qw(amatch);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if amatch('perl');
}
or the same ignoring case:
use String::Approx qw(amatch);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if amatch('perl', ['i']);
}
asubstitute
use String::Approx qw(asubstitute);
asubstitute("PATTERN", "SUBSTITUTION");
asubstitute("PATTERN", "SUBSTITUTION", @LIST);
asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ]);
asubstitute("PATTERN", "SUBSTITUTION", [ @MODS ], @LIST);
The PATTERN is a string, not a regular expression. The regular expression metanotation (. ? * + {...,...} ( ) | [ ] ^ $ \w ...
) will be understood as literal characters, that is, a *
means in regex terms \*
, not "match 0 or more times".
Also the SUBSTITUTION is a string, not a regular expression. Well, mostly. Most of the regular expression metanotation (.
, ?
, *
, +
, ...) will be not understood as literal characters, that is, a *
means in regex terms \*
, not "match 0 or more times". The understood notations are
$`
-
the part before the approximate match
$&
-
the approximately matched part
$'
-
the part after the approximate match
The MODS are the modifiers that tell how approximately to match. See below for more detailed explanation. NOTE: Yes, the syntax is really [ @MODS ]
, the square brackets [ ]
must be in there. See below for examples.
The LIST is the list of strings to substitute against the pattern. If no LIST is given substitutes against $_
.
In scalar context asubstitute()
returns the number of successful substitutions. In list context asubstitute()
returns the strings that had substitutions.
Examples:
use String::Approx qw(asubstitute);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if asubstitute('perl', '($&)');
}
or the same ignoring case:
use String::Approx qw(asubstitute);
open(WORDS, '/usr/dict/words') or die;
while (<WORDS>) {
print if asubstitute('perl', '($&)', [ 'i' ]);
}
Modifiers
The MODS argument both in amatch() and asubstitute() is a list of strings that control the matching of PATTERN. The first two, i and g, are the usual regular expression match/substitute modifiers, the rest are special for approximate matching/substitution.
- i
-
Match/Substitute ignoring case, case-insensitively.
- g
-
Substitute globally, that is, all the approximate matches, not just the first one.
- k
-
The maximum number of differences. For example 2.
- Ik
-
The maximum number of insertions. For example 'I2'.
- Dk
-
The maximum number of deletions. For example 'D2'.
- Sk
-
The maximum number of substitutions. For example 'S2'.
- k%
-
The maximum relative number of differences. For example '10%'.
- Ik%
-
The maximum relative number of insertions. For example 'I5%'.
- Dk%
-
The maximum relative number of deletions. For example 'D5%'.
- Sk%
-
The maximum relative number of substitutions. For example 'S5%'.
The regular expression modifiers o m s x
are not supported because their definitions for approximate matching are less than clear.
The relative number of differences is relative to the length of the PATTERN, rounded up: if, for example, the PATTERN is 'bouillabaise'
and the MODS is ['20%']
the k becomes 3.
If you want to disable a particular kind of difference you need to explicitly set it to zero: for example 'D0'
allows no deletions.
In case of conflicting definitions the later ones silently override, for example:
[2, 'I3', 'I1']
equals
['I1', 'D2', 'S2']
EXAMPLES
The following examples assume the following template:
use String::Approx qw(amatch asubstitute);
open(WORDS, "/usr/dict/words") or die;
while (<WORDS>) {
# <---
}
and the following examples just replace the above '# <---
' line.
Matching from the $_
- Match 'perl' with one difference
-
print if amatch('perl');
The one difference is automatically the result in this case because first the rule of the 10 % of the length of the pattern ('
perl
') is used and then the at least 1 rule. - Match 'perl' with case ignored
-
print if amatch('perl', [ 'i' ]);
The case is ignored in matching (
i
). - Match 'perl' with one insertion
-
print if amatch('perl', [ '0', 'I1' ]);
The one insertion is easiest achieved with first disabling any approximateness (
0
) and then enabling one insertion (I1
). - Match 'perl' with zero deletions
-
print if amatch('perl', [ 'D0' ]);
The zero deletion is easily achieved with simply disabling any deletions (
D0
), the other types of differences, the insertions and substitutions, are still enabled. - Substitute 'perl' approximately with HTML emboldening
-
print if amatch('perl', '<B>$&</B>', [ 'g' ]);
All (
g
) of the approximately matching parts of the input are surrounded by theHTML
emboldening markup.
Matching from a list
The above examples match against the default variable $_. The rest of the examples show how the match from a list. The template is now:
use String::Approx qw(amatch asubstitute);
open(WORDS, "/usr/dict/words") or die;
@words = <words>;
# <---
and the examples still go where the '# <---
' line is.
- Match 'perl' with one difference from a list
-
@matched = amatch('perl', @words);
The
@matched
contains the elements of the@words
that matched approximately. - Substitute 'perl' approximately with HTML emphasizing from a list
-
@substituted = asubstitute('perl', '<EM>$&</EM>', [ 'g' ], @words);
The
@substituted
contains with all (g
) the substitutions the elements of the@words
that matched approximately.
ERROR MESSAGES
- amatch: $_ is undefined: what are you matching against?
- asubstitute: $_ is undefined: what are you matching against?
-
Happen when you have nothing in $_ and try to
amatch()
orasubstitute()
. Perhaps you did forget the Perl option-n
?
VERSION
Version 2.0_01.
LIMITATIONS
Fixed Pattern
The PATTERNs of amatch()
and asubstitute()
are fixed strings, they are not regular expressions. The SUBSTITUTION of asubstitute()
is a bit more flexible than that but not by much.
Speed
Despite the about 20-fold speed increase from the String::Approx
version 1 agrep is still faster. If you do not know what agrep
is: it is a program like the UNIX grep but it knows, among other things, how to do approximate matching. agrep
is still about 30 times faster than Perl + String::Approx
. NOTE: all these speeds were measured in one particular system using one particular set of tests: your mileage will vary.
Incompatibilities with String::Approx
v1.*
If you have been using regular expression modifiers (i, g) you lose. Sorry about that. The syntax simply is not compatible. I had to choose between having amatch()
match and asubstitute()
substitute elsewhere than just in $_ and the old messy way of having an unlimited number of modifiers. The first need won.
There is a backward compability mode, though, if you do not want to change your amatch()
and asubstitute()
calls. You have to change your use
line, however:
use String::Approx qw(amatch compat1);
That is, you must add the compat1
symbol if you want to be compatible with the String::Approx
version 1 call syntax.
AUTHOR
Jarkko Hietaniemi <jhi@iki.fi>
ACKNOWLEDGEMENTS
Nathan Torkington <gnat@frii.com>