NAME
Text::Ngram::LanguageDetermine - Guess the language of text using ngrams
SYNOPSIS
use Text::Ngram::LanguageDetermine;
NOTE: First build some language profiles using source text, (easily obtained from places like Wikipedia or else where, subject matter really doesn't matter but, it should all be in the target language and saved in UTF-8 format).
my %lang_profiles = ( english => create_language_profile(source_filename => 'english.txt'), french => create_language_profile(source_filename => 'french.txt'), german => create_language_profile(source_filename => 'german.txt'), );
Get the profile of the text we're wondering about.
my $text_profile = make_text_profile(source_filename => 'query.txt');
Score the text profile against all of the language profiles.
my %scores = map { $_ => compare_profiles(language_profile => $language_profiles{$_}, text_profile => $text_profile) } keys %lang_profiles;
The score thats the smallest is the most likely answer, the score values themselves aren't actually relevant, just the ordering of the scores. lowest score = most likey, highest score = most unlikely.
print "Language is: " . (sort { $scores{$a} <=> $scores{$b} } keys $scores)[0] . "\n";
DESCRIPTION
This module performs the task of guessing what language a document is written in using an ngram profile of a large sample text of each language and the query text.
It does this by calculating the most frequent ngrams of the sample text for a language, ranking them by frequency then only keeping the most popular ngrams removing most subject specific ngrams. Then it compares the positions of the ngrams from the language sample text to the positions of the ngrams from the query text ranked by frequency to produce a score that indiciates the "Out of Place" measure. This measure determines how much the query text's ngrams are out of place with regard to a languages ngrams.
The language that produces the lowest "Out of Place" measure its most likely the language the text is written in.
This module was written after reading the paper "N-Gram-Based Text Categorization" by William B. Cavnar and John M. Trenkle, see: http://citeseer.csail.mit.edu/68861.html
BUGS
Might be some.
SUPPORT
Please contact the author with any patches, bug reports via email.
AUTHOR
Rusty Conover
CPAN ID: RCONOVER
InfoGears Inc.
rconover@infogears.com
http://www.infogears.com
COPYRIGHT
Copyright 2005 InfoGears Inc. http://www.infogears.com All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
SEE ALSO
perl(1). Text::Ngram
create_language_profile
Usage : create_language_profile(source_filename => 'english.sample',
destination_filename => 'english.profile', frequency_cutoff => 300,
ngram_max_length => 5).
Purpose : This function creates a language profile for future
comparision to text, its best to pass in a good 10k to 20k byte
sample of the language. It reads that data, creates ngrams of
various lengths from 1 to ngram_max_length and calculates the
frequency of each ngram throughout the entire text. After all of the
ngrams have been created and sorted by their frequency it truncates
the list to frequency_cutoff entries.
The frequency_cuttoff serves the purpose of only keeping the ngrams
that really aren't subject specific to the text, 300 seems to be a
good default but its open to tuning.
ngram_max_length is the maximum length of a ngram to be generated.
Again this length is open to tuning, but 5 characters seems to be a
good number.
Returns : This function returns a hash referench of ngrams with
ngrams as the key and their frequency rank as the value.
Argument :
source_filename - the filename where to read the source text
source_data - a scalar that contains the source text
destination_filename - the filename where to store the profile using
Storable.
frequency_cutoff - the cutoff frequency of the ngrams
ngram_max_length - the maximum length of the ngrams.
Throws : uses Carp::confess to complain about errors.
create_text_profile
Usage : create_text_profile(source_filename => 'interesting.txt',
ngram_max_length => 5)
Purpose : This function creates the comparison profile for an
arbitrary piece of text, it calculates all of the ngrams for the text
and then sorts them by frequency.
Returns : An array reference containing the ngrams sorted by
frequency of occurrance in the passed text.
Argument :
source_filename - The filename where to read the source text
source_data - The data to use as the source passed as a scalar
ngram_max_length - The maximum ngram length
Throws : uses Carp::confess for errors.
compare_profiles
Usage : compare_profiles(comparison_profile
=> $compare_profile, language_profile => $lang_profile)
Purpose : This function compares a language profile to a text profile
and calculates a score determining if the text's ngram frequency
matches well with the language's frequency. This is called the
"Out-of-Place" measure.
Returns : An integer score measuring how much the ngrams in the text
profile are out of place with the ngrams in the language profile.
Argument :
comparison_profile - the comparison profile created by create_text_profile()
language_profile - the language profile created by create_language_profile()
ngram_not_found_distance - the distance value used for ngrams not
found in the language profile, by default this is 2 * the total
ngrams in the language profile.
Throws : uses Carp::confess for bad arguments.
Comments : To determine which language the text is written in, the
best guess of this algorithm is the language with the lowest score
returned by this function.