NAME
Alvis::TermTagger - Perl extension for tagging terms in a corpus
SYNOPSIS
use Alvis::TermTagger;
Alvis::TermTagger::termtagging($termlist, $outputfile);
DESCRIPTION
This module is used to tag a corpus with terms. Corpus (given on the STDIN) is a file with one sentence per line. Term list ($termlist
) is a file containing one term per line. For each term, additionnal information (as canonical form) can be given after a column. Each line of the output file ($outputfile
) contains the sentence number, the term, additional information, all separated by a tabulation character.
This module is mainly used in the Alvis NLP Platform.
METHODS
termtagging()
termtagging($term_list_filename, $output_filename);
This is the main method of module. It loads the term list ($term_list_filename
) and tags the corpus ($corpus_filename
). It produces the list of matching terms and the sentence offset (and additional information given in the input file) where the terms can be found. The file $output_filename
contains this output.
load_TermList()
load_TermList($term_list_filename,\@term_list);
This method loads the term list ($term_list_filename
is the file name) in the array given by reference (\@term_list
). Each element of term list contains a reference to a two element array (the term and its canonical form).
get_Regex_TermList()
get_Regex_TermList(\@term_list, \@regex_term_list);
This method generates the regular expression from the term list (\@term_list
). stored in the specific array (\@regex_term_list
)
load_Corpus()
load_Corpus($corpus_filename\%corpus, \%lc_corpus);
This method loads the corpus ($corpus_filename
) in hashtable (\%corpus
) and prepares the corpus in lower case (recorded in a specific hashtable, \%lc_corpus
)
corpus_Indexing()
corpus_Indexing(\%lc_corpus, \%corpus_index);
This method indexes the lower case version of the corpus (\%lc_corpus
) according the words \%corpus_index
(the index is a hashtable given by reference).
term_Selection()
term_Selection(\%corpus_index, \@term_list, \%idtrm_select);
This method selects the terms from the term list (\@term_list
) potentially appearing in the corpus (that is the indexed corpus, \%corpus_index
). Results are recorded in the hash table \%idtrm_select
.
term_tagging_offset()
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename);
This method tags the corpus \%corpus
with the terms (issued from the term list \@term_list
, \@regex_term_list
is the term list with regular expression), and selected in a previous step (\%idtrm_select
). Resulting selected terms are recorded with their offset, and additional information in the file $output_filename
.
term_tagging_offset_tab()
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results);
or
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results);
This method tags the corpus \%corpus
with the terms (issued from the term list \@term_list
, \@regex_term_list
is the term list with regular expression), and selected in a previous step (\%idtrm_select
). Resulting selected terms are recorded with their offset, and additional information in the array @tab_results
(values are sentence id, selected terms and additional information separated by tabulation) or in the hashtable %tabh_results
(keys form is "sentenceid_selectedterm", values are an array reference containing sentence id, selected terms and additional ifnormation).
printMatchingTerm
printMatchingTerm($descriptor, $ref_matching_term, $sentence_id);
This method prints into the file descriptor $descriptor
, the sentence id ($sentence_id
) and the matching term (named by its reference $ref_matching_term
). Both data are on a line and are separated by a tabulation character.
printMatchingTerm_tab
printMatchingTerm_tab($ref_matching_term, $sentence_id, $ref_tab_results);
This method stores into $ref_tab_results
, the sentence id ($sentence_id
) and the matching term (named by its reference $ref_matching_term
). $ref_tab_results
can be a array or a hash table. In case of an array, both data are concatanated in a line and are separated by a tabulation character. In case of a hash table, both data are stored in an array, hash key is the concatenation of the sentence id and the matching term.
SEE ALSO
Alvis web site: http://www.alvis.info
AUTHORS
Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr>
LICENSE
Copyright (C) 2006 by Thierry Hamon
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.