NAME
Text::WagnerFischer::Armenian - a subclass of Text::WagnerFischer for Armenian-language strings
SYNOPSIS
use Text::WagnerFischer::Armenian qw( distance );
use utf8;
print distance("Õ±Õ¥Õ¼Õ¶", "Õ±Õ¥Õ¼Õ¡Õ¶") . "\n";
# "dzerrn -> dzerran"; prints 1
print distance("Õ±Õ¥Õ¼Õ¶", "Õ±Õ¥Ö€Õ¶") . "\n";
# "dzerrn -> dzern"; prints 0.5
print distance("Õ¯Õ«Õ¶Ö„", "Õ¯Õ«Õ¶") . "\n";
# "kin" -> "kink'"; prints 0.5
my @words = qw( Õ¦Ö…Ö€Õ½Õ¶ Ô¶Õ¸Ö€Õ½ Õ¦Õ¦Ö…Ö€Õ½Õ¶ );
my @distances = distance( "Õ¦Ö…Ö€Õ½", @words );
print "@distances\n";
# "zors" -> "zorsn, Zors, zzorsn"
# prints "0.5 0.25 1"
# Change the cost of a letter case substitution to 1
my $edit_values = [ ( 0, 1, 1, 1, 0.5, 0.5, 1 ), # string-beginning values
( 0, 1, 1, 1, 0.5, 1, 1 ), # string-beginning values
( 0, 1, 1, 1, 0.5, 1, 0.5 ), # string-beginning values
];
print distance( "Õ±Õ¥Õ¼Õ¶", "Õ�Õ¥Õ¼Õ¶" ) . "\n";
# prints 1
=DESCRIPTION
This module implements the Wagner-Fischer distance algorithm modified for Armenian strings. The Armenian language has a number of single-letter prefixes and suffixes which, while not changing the basic meaning of the word, function as definite articles, prepositions, or grammatical markers. These changes, and letter substitutions that represent vocalic equivalence, should be counted as a smaller edit distance than a change that is a normal character substitution.
The Armenian weight function recognizes four extra edit types:
/ a: x = y (cost for letter match)
| b: x = - or y = - (cost for letter insertion/deletion)
w( x, y ) = | c: x != y (cost for letter mismatch)
| d: x = X (cost for case mismatch)
| e: x ~ y (cost for letter vocalic equivalence)
| f: x = (z|y|ts) && y = - (or vice versa)
| (cost for grammatic prefix)
| g: x = (n|k'|s|d) && y = - (or vice versa)
\ (cost for grammatic suffix)
These distance weights can be changed, although the prefix/suffix part of the algorithm currently requires that the distance weights be specified three times (for the start, middle, and end of the string.) The weight arrays can be passed in as the first argument to distance.
BUGS
There are many cases of Armenian word equivalence that are not perfectly handled by this; it is meant to be a rough heuristic for comparing transcriptions of handwriting. In particular, multi-letter suffixes, and some orthographic equivalence e.g "o" -> "aw", are not handled at all.
AUTHOR
Tara L Andrews, aurum@cpan.org
SEE ALSO
"Text::WagnerFischer"
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 13:
Non-ASCII character seen before =encoding in 'distance("Õ±Õ¥Õ¼Õ¶",'. Assuming CP1252