NAME
Lingua::NATools::Lexicon - Encapsulates NATools Lexicon files
SYNOPSIS
use Lingua::NATools::Lexicon;
$lex = Lingua::NATools::Lexicon->new("file.lex");
$word = $lex->word_from_id(2);
$id = $lex->id_from_word("cavalo");
@ids = $lex->sentence_to_ids("era uma vez um gato maltez");
$sentence = $lex->ids_to_sentence(10,2,3,2,5,4,3,2,5);
$lex->size;
$lex->id_count(2);
$lex->close;
DESCRIPTION
This module encapsulates the NATools Lexicon files, making them accessible using Perl. The implementation is based on OO philosophy. First, you must open a lexicon file using:
$lex = Lingua::NATools::Lexicon->new("lexicon.file.lex");
When you have all done, do not forget to close it. This makes some memory frees, and is welcome for the process of opening new lexicon files.
$lex->close;
Lexicon files map words to identifiers and vice-versa. Its usage is simple: use
$lex->id_from_word($word)
to get an id for a word. Use
$lex->word_from_id($id)
to get back the word from the id. If you need to make big quantities of conversions to construct or parse a sentence use ids_to_sentence
or sentence_to_ids
respectively.
new
This is the Lingua::NATools::Lexicon
constructor. Pass it a lexicon file. These files usually end with a .lex
extension:
my $lexicon = Lingua::NATools::Lexicon->new("file.lex");
save
This method saves the current lexicon object in the supplied file:
$lexicon->save("/there/lexicon.lex");
close
Call this method to close a Lexicon. This is important to free resources (both memory and lexicons, as there is a limited number of open lexicons at a time).
$lexicon->close;
word_from_id
This method is used to convert one word-id to a word:
my $word = $lexicon->word_from_id ($word_id);
ids_to_sentence
This method calls word_from_id
for each passed parameter. Thus, it receives a list of word identifiers, and returns the corresponding string. Words are separated by a space character.
my $sentence = $lexicon->ids_to_sentence(1,3,5,2,3,6);
id_from_word
This method is used to convert one word to its corresponding identifier (word-id).
my $word_id = $lexicon->id_from_word( $word );
sentence_to_ids
This method calls id_from_word
for each word from a sentence. Note that the method does not perform the common tokenization task. It just splits the sentence by the space character. You must preprocess the string using a NLP tokenizer.
The method returns a reference to the list of identifiers.
my $wid_list = $lexicon->sentence_to_ids("a sentence");
id_count
This method returns the number of occurrences for a specific word. Note that the word must be supplied as its identifier, and not the string itself.
my $count = $lexicon->id_count( 45 );
occurrences
This method returns the size of the corpus (number of tokens) that originated the lexicon: it sums up occurrences for each word, and returns the total value.
my $total = $lexicon->occurrences;
size
This method returns the number of different words (types) from the corpus that originated the lexicon.
my $size = $lexicon->size;
add_word
This method adds a new word to the lexicon file. The word will be created with an occurrence count of 1.
Note that lexicon files can't be created from scratch using this module. The module is intended to manipulate already created lexicon files. A standard lexicon file doesn't have space for new words. You need to enlarge it before. Use the size
method to know the current size, and the enlarge
method to add some empty space.
$lexicon->add_word("dog");
set_id_count
After creating a new word (or in an old word...) you might want to change its occurrence. Call this method for that. Pass it the word identifier and the new occurrence count.
This method is benevolent and let you set a negative occurrence count. Setting an occurrence count to 0 will not delete the word entry.
$lexicon->set_id_count( $wid, ++$count);
enlarge
This method creates extra space for new words. You do not need to know its current size, just the number of words you need to add. Pass that as the argument to the method. The returning object should accomodate that more words. Also, try to call this method as few times as possible. First calculate the amount of words you need, then enlarge the Lexicon.
$lexicon->enlarge( 100 ); # 100 more words
SEE ALSO
See perl(1) and NATools documentation.
AUTHOR
Alberto Manuel Brandao Simoes, <albie@alfarrabio.di.uminho.pt>
COPYRIGHT AND LICENSE
Copyright 2002-2012 by NATURA Project
This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.