NAME

Text::Fuzzy - partial or fuzzy string matching using edit distances

SYNOPSIS

use Text::Fuzzy;
my $tf = Text::Fuzzy->new ('boboon');
print "Distance is ", $tf->distance ('babboon'), "\n";
# Prints "Distance is 2"
my @words = qw/the quick brown fox jumped over the lazy dog/;
my $nearest = $tf->nearest (\@words);
print "Nearest array entry is ", $words[$nearest], "\n";
# Prints "Nearest array entry is brown"

DESCRIPTION

This module calculates the Levenshtein edit distance between words, and does edit-distance-based searching of arrays and files to find the nearest entry. It can handle either byte strings or character strings (strings containing Unicode), treating each Unicode character as a single entity.

It is designed for high performance in searching for the nearest to a particular search term over an array of words or a file, by reducing the number of calculations which needs to be performed.

It supports either bytewise edit distances or Unicode-based edit distances:

use utf8;
my $tf = Text::Fuzzy->new ('あいうえお☺');
print $tf->distance ('うえお☺'), "\n";
# prints "2".

The default edit distance is the Levenshtein edit distance, which applies an equal weight of one to additions (cat -> cart), substitutions (cat -> cut), and deletions (carp -> cap). Optionally, the Damerau-Levenshtein edit distance, which additionally allows transpositions (salt -> slat) may be selected using the method "transpositions_ok".

METHODS

new

my $tf = Text::Fuzzy->new ('bibbety bobbety boo');

Create a new Text::Fuzzy object from the supplied word.

distance

my $dist = $tf->distance ($word);

Return the edit distance to $word from the word used to create the object in "new".

nearest

my $index = $tf->nearest (\@words);

This returns the index of the nearest element in the array to the argument to "new". If none of the elements are less than the maximum distance away from the word, $index is -1.

if ($index >= 0) {
    printf "Found at $index, distance was %d.\n",
        $tf->last_distance ();
}

Use "set_max_distance" to alter the maximum distance used.

If there is more than one word with the same edit distance in @words, this returns the last one found, unless it is an exact match, in which case it returns the first one found. To get all matches, call it in array context:

my @nearest = $tf->nearest (\@words);

If there are no matches within minimum distance, it returns an empty list. If there is one or more match, it returns the array offset of it, not the value itself.

use Text::Fuzzy;

my @funky_words = qw/
nice
funky
rice
gibbon
lice
graham
garden
/;

my $tf = Text::Fuzzy->new ('dice');

my @nearest = $tf->nearest (\@funky_words);

print "The nearest words are ";
print join ", ", (map {$funky_words[$_]} @nearest);
printf ", distance %d.\n", $tf->last_distance ();

# Prints out "The nearest words are nice, rice, lice."

last_distance

my $last_distance = $tf->last_distance ();

The distance from the previous match closest match. This is used in conjunction with "nearest" to find the edit distance to the previous match.

set_max_distance

# Set the max distance.
$tf->set_max_distance (3);

Set the maximum edit distance of $tf. The default maximum distance is 10. Set the maximum distance to a low value to improve the speed of searches over lists with "nearest", or to reject unlikely matches. When searching for a near match, anything with an edit distance of a value at least as high as the maximum is rejected without computing the exact distance. To compute exact distances, call this method with zero or undefined, the maximum edit distance is switched off, and whatever the nearest match is is accepted.

get_max_distance

# Get the maximum edit distance.
print "The max distance is ", $tf->get_max_distance (), "\n";

Get the maximum edit distance of $tf. The default is set to 10. The maximum distance may be set with "set_max_distance".

scan_file

$tf->scan_file ('/usr/share/dict/words');

Scan a file to find the nearest match to the word used in "new". This assumes that the file contains lines of text separated by newlines and finds the closest match in the file.

This does not currently support Unicode-encoded files.

transpositions_ok

$tf->transpositions_ok (1);

A true value in the argument changes the type of edit distance used to allow transpositions, such as clam and calm. Initially transpositions are not allowed, giving the Levenshtein edit distance. If transpositions are used, the edit distance becomes the Damerau-Levenshtein edit distance. A false value disallows transpositions:

$tf->transpositions_ok (0);

no_exact

$tf->no_exact (1);

Use this to make "nearest" ignore exact matches. For example,

use Text::Fuzzy;

my @words = qw/bibbity bobbity boo/;
for my $word (@words) {
    my $tf = Text::Fuzzy->new ($word);
    $tf->no_exact (0);
    my $nearest = $tf->nearest (\@words);
    if ($nearest >= 0) {
	print "With exact, nearest to $word is $words[$nearest]\n";
    }
    # Make "$word" not match itself.
    $tf->no_exact (1);
    my $nearest = $tf->nearest (\@words);
    if ($nearest >= 0) {
	print "Without exact, nearest to $word is $words[$nearest]\n";
    }
}

This prints

With exact, nearest to bibbity is bibbity
Without exact, nearest to bibbity is bobbity
With exact, nearest to bobbity is bobbity
Without exact, nearest to bobbity is bibbity
With exact, nearest to boo is boo
Without exact, nearest to boo is bobbity

EXAMPLES

misspelt-web-page.cgi

The file examples/misspelt-web-page.cgi is an example of a CGI script which does something similar to the Apache mod_speling module, offering spelling corrections for mistyped URLs and sending the user to a correct page.

use Text::Fuzzy;

# The directory of files served by the web server.

my $web_root = '/usr/local/www/data';

# If the query is "http://www.example.com/abc/xyz.html", $path_info is
# "abc/xyz.html".

my $path_info = $ENV{REQUEST_URI};

if (! defined $path_info) {
    fail ("No path info");
}

if ($0 =~ /$path_info/) {
    fail ("Don't redirect to self");
}

# This is the list of files under the main page.

my @allfiles = get_all_files ($web_root, '');

# This is our spelling search engine.

my $tf = Text::Fuzzy->new ($path_info);

my $nearest = $tf->nearest (\@allfiles);

if ($nearest >= 0) {
    redirect ($allfiles[$nearest]);
}
else {
    fail ("Nothing like $path_info was found");
}
exit;

# Read all the files under "$root/$dir". This is recursive. The return
# value is an array containing all files found.

sub get_all_files
{
    my ($root, $dir) = @_;
    my @allfiles;
    my $full_dir = "$root/$dir";
    if (! -d $full_dir) {
        fail ("$full_dir is not a directory");
    }
    opendir DIR, $full_dir or fail ("Can't open directory $full_dir: $!");
    my @files = grep !/^\./, readdir DIR;
    closedir DIR or fail ("Can't close $full_dir: $!");
    for my $file (@files) {
        my $dir_file = "$dir/$file";
        my $full_file = "$root/$dir_file";
        if (-d $full_file) {
            push @allfiles, get_all_files ($root, $dir_file);
        }
        else {
            push @allfiles, $dir_file;
        }
    }
    return @allfiles;
}

# Print a "permanent redirect" to the respelt name, then exit.

sub redirect
{
    my ($url) = @_;
    print <<EOF;
Status: 301
Location: $url

EOF
    exit;
}

# Print an error message for the sake of the requester, and print a
# message to the error log, then exit.

sub fail
{
    my ($error) = @_;
    print <<EOF;
Content-Type: text/plain

$error
EOF
    # Add the name of the program and the time to the error message,
    # otherwise the error log will get awfully confusing-looking.
    my $time = scalar gmtime ();
    print STDERR "$0: $time: $error\n";
    exit;
}

See also http://www.lemoda.net/perl/perl-mod-speling/ for how to set up .htaccess to use the script.

spell-check.pl

The file examples/spell-check.pl is a spell checker. It uses a dictionary of words specified by a command-line option "-d":

spell-check.pl -d /usr/dict/words file1.txt file2.txt

It prints out any words which look like spelling mistakes, using the dictionary.

use Getopt::Long;
use Text::Fuzzy;
use Lingua::EN::PluralToSingular 'to_singular';

my $dict = '/usr/share/dict/words';
GetOptions (
    "dict=s" => \$dict,
);
my @words;
my %words;
my $min_length = 4;
read_dictionary ($dict, \@words, \%words);
# Known mistakes, don't repeat.
my %known;
# Spell-check each file on the command line.
for my $file (@ARGV) {
    open my $input, "<", $file or die "Can't open $file: $!";
    while (<$input>) {
        my @line = split /[^a-z']+/i, $_;
        for my $word (@line) {
            my $clean_word = to_singular (lc $word);
            if ($words{$clean_word}) {
                # It is in the dictionary.
                next;
            }
            if (length $word < $min_length) {
                # Very short words are ignored.
                next;
            }
            if ($word eq uc $word) {
                # Acronym like BBC, IRA, etc.
                next;
            }
            if ($known{$clean_word}) {
                # This word was already given to the user.
                next;
            }
            my $tf = Text::Fuzzy->new ($clean_word);
            my $nearest = $tf->nearest (\@words);
            if ($nearest >= 0) {
                my $correction = $words[$nearest];
                print "$file:$.: '$word' may be $correction.\n";
                $known{$clean_word} = $correction;
            }
            else {
                print "$file:$.: $word may be a spelling mistake.\n";
                $known{$clean_word} = 1;
            }
        }
    }
    close $input or die $!;
}

exit;

sub read_dictionary
{
    my ($dict, $words_array, $words_hash) = @_;    
    open my $din, "<", $dict or die "Can't open dictionary $dict: $!";
    while (<$din>) {
        chomp;
        push @$words_array, lc $_;
        $words_hash->{$_} = 1;
    }
    close $din or die $!;
}

Because the usual Unix dictionary doesn't have plurals, it uses Lingua::EN::PluralToSingular, to convert nouns into singular forms. Unfortunately it still misses past participles and past tenses of verbs.

extract-kana.pl

The file examples/extract-kana.pl extracts the kana entries from "edict", a freely-available Japanese to English electronic dictionary, and does some fuzzy searches on them. It requires a local copy of the file to run. This script demonstrates the use of Unicode searches with Text::Fuzzy.

use Lingua::JA::Moji ':all';
use Text::Fuzzy;
binmode STDOUT, ":utf8";
my $infile = '/home/ben/data/edrdg/edict';
open my $in, "<:encoding(EUC-JP)", $infile or die $!;
my @kana;
while (<$in>) {
    my $kana;
    if (/\[(\p{InKana}+)\]/) {
	$kana = $1;
    }
    elsif (/^(\p{InKana}+)/) {
	$kana = $1;
    }
    if ($kana) {
	$kana = kana2katakana ($kana);
	push @kana, $kana;
    }
}
printf "Starting fuzzy searches over %d lines.\n", scalar @kana;
search ('ウオソウコ');
search ('アイウエオカキクケコバビブベボハヒフヘホ');
search ('アルベルトアインシュタイン');
search ('バババブ');
search ('バババブアルベルト');
exit;

sub search
{
    my ($silly) = @_;
    my $search = Text::Fuzzy->new ($silly);
    my $n = $search->nearest (\@kana);
    if ($n >= 0) {
	printf "$silly nearest is $kana[$n] (distance %d)\n",
	    $search->last_distance ();
    }
    else {
	printf "Nothing like '$silly' was found within the edit distance %d.\n",
	    $search->get_max_distance ();
    }
}

SUPPORT

Reporting a bug

There is a bug tracker for the module at https://github.com/benkasminbullock/Text-Fuzzy/issues

Mailing list

There is a mailing list at Google Groups at https://groups.google.com/group/textfuzzy

Testing

The CPAN tester results are at http://www.cpantesters.org/distro/T/Text-Fuzzy.html. The ActiveState tester results are at http://code.activestate.com/ppm/Text-Fuzzy/.

PRIVATE METHODS

These methods are not expected to be useful for the general user. They may be useful in benchmarking the module and checking its correctness.

no_alphabet

$tf->no_alphabet (1);

This turns off alphabetizing of the string. Alphabetizing is a filter used in "nearest" where the intersection of all the characters in the two strings is computed, and if the alphabetical difference of the two strings is greater than the maximum distance, the match is rejected without applying the dynamic programming algorithm. This increases speed, because the dynamic programming algorithm is slow.

The alphabetizing should not ever reject anything which is a legitimate match, and it should make the program run faster in almost every case. The only envisaged uses of switching this off are checking that the algorithm is working correctly, and benchmarking performance.

get_trans

my $trans_ok = $tf->get_trans ();

This returns the value set by "transpositions_ok".

unicode_length

my $length = $tf->unicode_length ();

This returns the length in characters (not bytes) of the string used in "new". If the string is not marked as Unicode, it returns the undefined value. In the following, $l1 should be equal to $l2.

use utf8;
my $word = 'ⅅⅆⅇⅈⅉ';
my $l1 = length $word;
my $tf = Text::Fuzzy->new ($word);
my $l2 = $tf->unicode_length ();

ualphabet_rejections

my $rejected = $tf->ualphabet_rejections ();

After running "nearest" over an array, this returns the number of entries of the array which were rejected using only the alphabet. Its value is reset to zero each time "nearest" is called.

length_rejections

my $rejected = $tf->length_rejections ();

After running "nearest" over an array, this returns the number of entries of the array which were rejected because the length difference between them and the target string was larger than the maximum distance allowed.

ACKNOWLEDGEMENTS

The edit distance including transpositions was contributed by Nick Logan (UGEXE). Some of the tests in t/trans.t are taken from the Text::Levenshtein::Damerau::XS module.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.

To install Text::Fuzzy, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Fuzzy

CPAN shell

perl -MCPAN -e shell
install Text::Fuzzy

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)