NAME
Lingua::Lid - Interface to the language and encoding identifier "lid"
SYNOPSIS
use Lingua::Lid qw/:all/;
# Identify the language and character encoding of...
# ...a string
$result = lid_fstr("This is a short English sentence.");
# ...a plain text file
$result = lid_ffile("/path/to/a/file.txt");
print "Lingua::Lid v$Lingua::Lid::VERSION, using lid v",
lid_version(), "\n";
DESCRIPTION
The Perl extension Lingua::Lid provides a Perl interface to Lingua-Systems' language and character encoding identification library lid, which is required to build and use this extension.
The interface is implemented using the XS language and makes the functionality of the lid C library functions available to Perl applications and modules in a simple to use way.
This man page covers the usage of the Lingua::Lid Perl extension only - for more information on lid and a list on supported languages and character encodings, have a look at its manual, which is both included in its distribution and freely available under http://www.lingua-systems.com/language-identifier/lid-library/.
Lingua::Lid aims to stick with the C interface as close as reasonable - but with respect to common Perl conventions. Have a look at "COMPARISON TO THE C INTERFACE" for details.
EXPORTS
No symbols are exported by default.
Any function needed must either be requested for import explicitly or the export tag :all
may be used to import symbols for all provided functions:
use Lingua::Lid qw/lid_ffile lid_fstr/; # or
use Lingua::Lid qw/:all/;
FUNCTIONS
lid_fstr( $string
)
Mnemonic: "Language and encoding identification... from string"
This function takes a $string
as an argument and identifies its language and encoding. It returns a hash reference containing the results. See IDENTIFICATION RESULTS DATA STRUCTURE for details.
If an error occurs, the function returns undef
and sets $Lingua::Lid::errstr
to an appropriate message describing the error.
lid_ffile( $file
)
Mnemonic: "Language and encoding identification... from file"
This function takes a plain text $file
's path as an argument and identifies its language and encoding. It returns a hash reference containing the results. See IDENTIFICATION RESULTS DATA STRUCTURE for details.
If an error occurs, the function returns undef
and sets $Lingua::Lid::errstr
to an appropriate message describing the error.
lid_version( )
This function returns the version of the underlying lid C library.
IDENTIFICATION RESULTS DATA STRUCTURE
The functions lid_fstr() and lid_ffile() return a hash reference containing the results of the language and encoding identification.
The hash reference contains the following keys:
- language
-
The language's name (in English), i.e. "German", "French", "English".
- isocode
-
The language's ISO 639-3 code, i.e. "deu", "fra", "eng".
- encoding
-
The character encoding, i.e. "UTF-8", "ISO-8859-1", "UTF-32BE".
$result = {
'language' => 'English',
'isocode' => 'eng',
'encoding' => 'ASCII'
};
ERROR HANDLING
The functions lid_fstr() and lid_ffile() return undef
if an error occurs and set Lingua::Lid's package variable $errstr
($Lingua::Lid::errstr
) to an appropriate message describing the error.
Have a look at lid's manual for a list of all error messages.
- NOTE:
-
The
$Lingua::Lid::errstr
variable is reset toundef
whenever lid_fstr() or lid_ffile() are called.
COMPARISON TO THE C INTERFACE
Lingua::Lid's function lid_fstr() and lid_ffile() behave exactly as their lid counterparts in C.
The C functions lid_fnstr() and lid_fwstr() are not needed, use the Lingua::Lid function lid_fstr() in any Perl code instead.
The C function lid_strerror() and the global C variable lid_errno
are not needed. Rather than returning a pointer to NULL
, Lingua::Lid's lid_fstr() and lid_ffile() return undef
on errors and set $Lingua::Lid::errstr
to an appropriate message describing the error.
The C define LID_VERSION
is not available in Lingua::Lid, use lid_version() instead.
Lingua::Lid's results data structure sticks to the C lid_t *
structure as close as possible. See "IDENTIFICATION RESULTS DATA STRUCTURE" above.
EXAMPLES
use strict;
use Lingua::Lid qw/lid_fstr lid_version/;
print "Lingua::Lid v$Lingua::Lid::VERSION, using lid v",
lid_version(), "\n";
my @strings =
(
"This is a short English sentence.",
"Dies ist ein kurzer deutscher Satz.",
"Too short."
);
foreach my $string (@strings)
{
if (my $r = lid_fstr($string))
{
print join(" - ", $r->{language}, $r->{isocode},
$r->{encoding}), "\n";
}
else
{
print "lid_fstr() failed: $Lingua::Lid::errstr\n";
}
}
The program above produces the following output:
Lingua::Lid v0.01, using lid v2.0.2
English - eng - ASCII
German - deu - ASCII
lid_fstr() failed: Insufficient input length
BUGS
None known.
Please report bugs either using CPAN's bug tracker or to <perl@lingua-systems.com>.
SEE ALSO
Lingua::Lid's website: http://www.lingua-systems.com/language-identifier/Lingua-Lid-Perl-extension/
lid's website: http://www.lingua-systems.com/language-identifier/lid-library/
lid's manual (available in English and German)
AUTHOR
Alex Linke, <alinke@lingua-systems.com>
COPYRIGHT AND LICENSE
Copyright (C) 2009 Lingua-Systems Software GmbH
This extension is free software. It may be used, redistributed and/or modified under the terms of the zlib license. For details, see the full text of the license in the file LICENSE.