NAME
Lingua::Guess - Guess the language of text
SYNOPSIS
use utf8;
use Lingua::Guess;
my $guesser = Lingua::Guess->new ();
my @lines = split (/\n/, <<EOF);
This is a test of the language checker
Verifions que le détecteur de langues marche
Sprawdźmy, czy odgadywacz języków pracuje
EOF
for my $line (@lines) {
my $guess = $guesser->simple_guess ($line);
print "'$line' was $guess\n";
}
produces output
'This is a test of the language checker' was english
'Verifions que le détecteur de langues marche' was french
'Sprawdźmy, czy odgadywacz języków pracuje' was polish
(This example is included as synopsis.pl in the distribution.)
DESCRIPTION
This module attempts to guess what human language a piece of text is written in.
It is a fork of a module called Language::Guess, which was deleted from CPAN by its author.
METHODS
new
my $lg = Lingua::Guess->new ();
Make a new object. This takes a hash as argument with the following keys:
- modeldir
-
The location of the training data. If this is not supplied, the training data supplied with the module is used.
guess
my $guess = $lg->guess ($text);
This method returns an arrayref of hashes in the form
[{name => NAME, score => SCORE, code2 => 'en', code3 => 'eng'}]
where score
represents the likelihood of the language given by name
as a fractional probability, and code2
and code3
are the ISO language codes of the language.
simple_guess
my $name = $lg->simple_guess ($text);
This method returns a one-word string containing the English name of the guessed language.
DEPENDENCIES
- Carp
-
"croak" in Carp is used to report errors.
- File::Spec::Functions
-
"catfile" in File::Spec::Functions is used in reading the training data files.
- JSON::Parse
-
"read_json" in JSON::Parse is used to read in configuration information.
- Unicode::Normalize
-
"NFC" in Unicode::Normalize is used to normalize inputs.
- Unicode::UCD
-
"charinfo" in Unicode::UCD is used to get information about characters
SEE ALSO
- Text::Guess::Language
- Text::Language::Guess
- Language::Guess at backpan.perl.org
-
This is the module which Lingua::Guess was forked from.
BUGS
The module has a number of oddities, which I'll slowly be trying to resolve.
The module will instantly decide that something is Korean or Japanese if it has just one Korean or Japanese letter in it. That means that, for example, if you give it a Wikipedia page to guess the language, it will seize upon the text in the interlanguage link and instantly proclaim it to be in Korean or Japanese, regardless of how much other text there may be.
STANDALONE SCRIPT
A standalone script called linguaguess
is installed with the module. It requires Unicode::UTF8 and File::Slurper to be installed locally. These modules are not dependencies of Lingua::Guess.
HISTORY
This module used to be called Language::Guess. It was released in 2004. It was deleted from CPAN at an unknown date, but not before two Python forks, https://pypi.python.org/pypi/guess-language, and https://bitbucket.org/spirit/guess_language and one C++ fork https://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp, had been created. It was restored to CPAN under the title Lingua::Guess by Ben Bullock on 17th April 2017. Changes to the original module include
- Removal of Unicode upgrading
-
For some reason, the module itself, which contains no non-ASCII, had "use utf8;" at the top of it, and the test file, which contained various languages, had no "use utf8;". The module was also upgrading all input bytes with "_utf8_on" in Encode rather than using "decode_utf8" in Encode. This behaviour was completely excised from the module.
- Training data moved into module's space
-
The training data was moved into the module's space using the method described in Acme::Include::Data. "modeldir" was given a default value of this directory.
- Documentation expanded
-
Most of the methods weren't documented at all in the original module.
- Bugs fixed
-
Two eleven year old bugs in the bug queue for Language::Guess were fixed.
- train method removed
-
An undocumented, unused, and untested method called "train" was removed from the module.
COPYRIGHT
(c) 2004 National Institute for Technology and Liberal Education (c) 2017-2021 Ben Bullock
LICENSE
This software is released under version 2 of the GNU General Public License.
MAINTAINER
This module is maintained by Ben Bullock <bkb@cpan.org>.
AUTHOR
Maciej Ceglowski