NAME
Text::Fuzzy - partial or fuzzy string matching using edit distances
SYNOPSIS
use Text::Fuzzy;
my $tf = Text::Fuzzy->new ('boboon');
print "Distance is ", $tf->distance ('babboon'), "\n";
# Prints "Distance is 2"
my @words = qw/the quick brown fox jumped over the lazy dog/;
my $nearest = $tf->nearest (\@words);
print "Nearest array entry is ", $words[$nearest], "\n";
# Prints "Nearest array entry is brown"
DESCRIPTION
This module calculates the Levenshtein edit distance between words, and does edit-distance-based searching of arrays and files to find the nearest entry. It can handle either byte strings or character strings (strings containing Unicode), treating each Unicode character as a single entity.
It is designed for high performance in searching for the nearest to a particular search term over an array of words or a file, by reducing the number of calculations which needs to be performed.
It supports either bytewise edit distances or Unicode-based edit distances:
use utf8;
my $tf = Text::Fuzzy->new ('あいうえお☺');
print $tf->distance ('うえお☺'), "\n";
# prints "2".
METHODS
new
my $tf = Text::Fuzzy->new ('bibbety bobbety boo');
Create a new Text::Fuzzy object from the supplied word.
distance
my $dist = $tf->distance ($word);
Return the edit distance to $word
from the word used to create the object in "new".
nearest
my $index = $tf->nearest (\@words);
Return the index of the nearest element in the array to the argument to "new". If none of the elements are less than the maximum distance away from the word, $index
is -1.
if ($index >= 0) {
print "Found at $index.\n";
}
last_distance
my $last_distance = $tf->last_distance ();
The distance from the previous match. This is usually used in conjunction with "nearest" to find the edit distance to the previous match.
get_max_distance
# Get the maximum edit distance.
print "The max distance is ", $tf->get_max_distance (), "\n";
Get the maximum edit distance of $tf
. The default is set to 10.
set_max_distance
# Set the max distance.
$tf->set_max_distance (3);
Set the maximum edit distance of $tf
. The default is set to 10. If this is called with an undefined value, the maximum edit distance is switched off.
scan_file
$tf->scan_file ('/usr/share/dict/words');
Scan a file to find the nearest match to the word used in "new". This assumes that the file contains lines of text separated by newlines and finds the closest match in the file.
This does not currently support Unicode-encoded files.
EXAMPLES
misspelt-web-page.cgi
The file examples/misspelt-web-page.cgi is an example of a CGI script which does something similar to the Apache mod_speling module, offering spelling corrections for mistyped URLs and sending the user to a correct page.
See the file in the distribution for details. See also http://www.lemoda.net/perl/perl-mod-speling/ for how to set up .htaccess to use the script.
spell-check.pl
The file examples/spell-check.pl is a spell checker. It uses a dictionary of words specified by a command-line option "-d":
spell-check.pl -d /usr/dict/words file1.txt file2.txt
It prints out any words which look like spelling mistakes, using the dictionary.
Because the usual Unix dictionary doesn't have plurals, it uses Lingua::EN::PluralToSingular, to convert nouns into singular forms. Unfortunately it still misses past participles and past tenses of verbs.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2013 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.