NAME

String::Approx - approximate matching and substitution

SYNOPSIS

use String::Approx qw(amatch asubstitute);

# amatch() and asubstitute imported to the current namespace,
# by default _nothing_ is imported

DESCRIPTION

String::Approx is a Perl module for matching and substituting strings in a fuzzy way - approximately.

amatch

amatch($approximate_string[, ...]);

All the other amatch() arguments are optional except the $approximate_string itself.

The additional parameters are strings of any the forms:

number		e.g. '1', the maximum number of transformations
		for all the transformation types,
		the types being insert/delete/substitute

number%		e.g. '15%', the relative maximum number of
		all the transformations is 15% of the
		approximating string

[IDS]number	e.g. 'I2', the maximum number of insertions

[IDS]number%	e.g. 'D20%', the relative maximum number of
		deletions is 20% of the length of the
		approximating string

[gimosx]	e.g. 'im', the usual m// modifiers

The default is parameter '10%'. Two noteworthy points:

the relative amounts, especially the default 10%,
would often result in number of allowed 'errors' being
less than 1. this, however, does not happen. internally
the minimum is forced to be 1.
(0 can be and must be explicitly asked for)

the relative amounts are rounded to the nearest whole
number in the standard way, e.g. 10% of 15 will end up
being 2.

You can combine all the number parameter types into a single string, e.g. '15%i2'.

An example:

use String::Approx qw(amatch);

open(WORDS, '/usr/dict/words') or die;

while (<WORDS>) {
  print if amatch('perl');
}

asubstitute

asubstitute($approximate_string, $substitute[, ...]);

Otherwise identical parameters with amatch() except that the second argument is the substitution string, $substitute.

One can use in the substitution string the special marker $& that represents the approximately matched string.

An example:

use String::Approx qw(asubstitute);

open(WORDS, '/usr/dict/words') or die;

while (<WORDS>) {
  print if asubstitute('perl', '<$&>);
}

RETURN VALUES

In scalar context amatch() and asubstitute() return the number of possible matches and substitutions. In list context they return the list all the possible matches and substitutions. Note that in the case of asubstitute() the list of possible substitutions may be longer than the list of done substitutions because possible substitutions may overlap. The first and the longest substitutions are done first, the rest are done if they do not overlap the already one substitutions.

As a side-effect asubstitute() may change the value of $_ if approximate matches are found.

Note that error messages and warnings come from amatch(), not from asubstitute().

More Examples

amatch($s);		# the maximum amount of approximateness
			# is max(1,10_%_of_length($s))
amatch($s, 1);		# the maximum number of any
			# insertions/deletes/substitutions
			# (_separately_) is 1
amatch($s, 'I1D0S30%');	# the maximum amount for insertions is 1,
       			# deletions are not allowed, the maximum
			# amount of substitutions
			# is max(1,5_%_of_length($s))

asubstitute($s, '($&)', 'g');
			# surround in $_ all ('g') the approximate
			# matches by parentheses

asubstitute($s, '&func', 'e');
			# substitute in $_ the first approximate
			# match with the result of &func (without
			# the 'e' literal string '&func' would be
			# the substitute)

LIMITATIONS

You cannot mix approximate matching and normal Perl regular expressions (see perlre). Please do not even think about it.

Do not use characters .?*+{}[](|)^$\ (that is, any characters that have special meaning in regular expressions) in your approximate strings.

Matching and substitution are always done on $_. The =~ binding operator (see perlop) can only be used with the Perl builtins m//, s///, and tr///, not for user-defined functions such as amatch().

The agrep is faster. Searching for 'perl' with one each [IDS] allowed from a wordlist of 25486 words took with amatch() 656 seconds on a RISC box while agrep took 0.77 seconds. This is mainly because String::Approx does the same things with an interpreted language, Perl, whereas agrep does it in compiled, language, C, and because doing approximate matching is very demanding operation, especially the substitutions. String::Approx does it by (ab)using regular expressions which is quite wasteful, approximate matching should be built in Perl for it to be fast. The time taken by the insertion operation, I, is about 30%, by the deletion, D, about 20%, and by the substitution, S, about 50%. (In case you are wondering, yes, agrep and amatch() did agree on the list of matching words)

VERSION

v1.6

AUTHOR

Jarkko Hietaniemi, jhi@iki.fi