NAME
Lingua::Ident -- Statistical language identification
SYNOPSIS
use Lingua::Ident;
$i = new Lingua::Ident("filename 1" ... "filename n");
$lang = $i->identify("text to classify"), "\n";
DESCRIPTION
This module implements a statistical language identifier.
The filename attributes to the constructor must refer to files containing tables of n-gram probabilites for languages. These tables can be generated using the trainlid(1) utility program.
RETURN VALUE
The identify() method returns the value specified in the _LANG field of the probabilities table of the language to which the text most likely belongs (see "WARNINGS").
It is recommended to be a POSIX locale name constructed from an ISO 639 2-letter language code, possibly extended by an ISO 3166 2-letter country code and a character set identifier. Example: de_DE.iso88591.
WARNINGS
Since Lingua::Ident is based on statistics it cannot be 100 % accurate. More precisely, Dunning (see below) reports his implementation to achieve 92 % accuracy with 50K of training text for 20 character strings discriminating bewteen English and Spanish. This implementation should be as accurate as Dunning's. However, not only the size but also the quality of the training text play a role.
The current implementation doesn't use a threshold to determine if the most probable language has a high enough probability; if you're trying to classify a text in a language for which there is no probability table, this results in getting an incorrect language.
AUTHOR
Lingua::Ident was developed by Michael Piotrowski <mxp@dynalabs.de>.
SEE ALSO
Dunning, Ted (1994). Statistical Identification of Language. Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University.