NAME
Alvis::TermTagger - Perl extension for tagging terms in a corpus
SYNOPSIS
use Alvis::TermTagger;
Alvis::TermTagger::termtagging($termlist, $outputfile);
DESCRIPTION
This module is used to tag a corpus with terms. Corpus (given on the STDIN) is a file with one sentence per line. Term list ($termlist
) is a file containing one term per line. For each term, additionnal information (as canonical form or semantic tag) can be given after the first column. This information can be separated by either a column, either by a vertical bar. Each line of the output file ($outputfile
) contains the sentence number, the term, additional information, all separated by a tabulation character.
This module is mainly used in the Alvis NLP Platform.
METHODS
termtagging()
termtagging($term_list_filename, $output_filename);
This is the main method of module. It loads the term list ($term_list_filename
) and tags the corpus ($corpus_filename
). It produces the list of matching terms and the sentence offset (and additional information given in the input file) where the terms can be found. The file $output_filename
contains this output.
load_TermList()
load_TermList($term_list_filename,\@term_list);
This method loads the term list ($term_list_filename
is the file name) in the array given by reference (\@term_list
). Each element of term list contains a reference to a two element array (the term and its canonical form).
get_Regex_TermList()
get_Regex_TermList(\@term_list, \@regex_term_list);
This method generates the regular expression from the term list (\@term_list
). stored in the specific array (\@regex_term_list
)
load_Corpus()
load_Corpus($corpus_filename\%corpus, \%lc_corpus);
This method loads the corpus ($corpus_filename
) in hashtable (\%corpus
) and prepares the corpus in lower case (recorded in a specific hashtable, \%lc_corpus
)
corpus_Indexing()
corpus_Indexing(\%lc_corpus, \%corpus_index);
This method indexes the lower case version of the corpus (\%lc_corpus
) according the words \%corpus_index
(the index is a hashtable given by reference).
print_corpus_index()
print_corpus_index(\%corpus_index);
This method prints on STDERR the corpus index \%corpus_index
.
term_Selection()
term_Selection(\%corpus_index, \@term_list, \%idtrm_select);
This method selects the terms from the term list (\@term_list
) potentially appearing in the corpus (that is the indexed corpus, \%corpus_index
). Results are recorded in the hash table \%idtrm_select
.
term_tagging_offset()
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename);
This method tags the corpus \%corpus
with the terms (issued from the term list \@term_list
, \@regex_term_list
is the term list with regular expression), and selected in a previous step (\%idtrm_select
). Resulting selected terms are recorded with their offset, and additional information in the file $output_filename
.
term_tagging_offset_tab()
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results);
or
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results);
This method tags the corpus \%corpus
with the terms (issued from the term list \@term_list
, \@regex_term_list
is the term list with regular expression), and selected in a previous step (\%idtrm_select
). Resulting selected terms are recorded with their offset, and additional information in the array @tab_results
(values are sentence id, selected terms and additional information separated by tabulation) or in the hashtable %tabh_results
(keys form is "sentenceid_selectedterm", values are an array reference containing sentence id, selected terms and additional ifnormation).
printMatchingTerm
printMatchingTerm($descriptor, $ref_matching_term, $sentence_id);
This method prints into the file descriptor $descriptor
, the sentence id ($sentence_id
) and the matching term (named by its reference $ref_matching_term
). Both data are on a line and are separated by a tabulation character.
printMatchingTerm_tab
printMatchingTerm_tab($ref_matching_term, $sentence_id, $ref_tab_results);
This method stores into $ref_tab_results
, the sentence id ($sentence_id
) and the matching term (named by its reference $ref_matching_term
). $ref_tab_results
can be a array or a hash table. In case of an array, both data are concatanated in a line and are separated by a tabulation character. In case of a hash table, both data are stored in an array, hash key is the concatenation of the sentence id and the matching term.
SEE ALSO
Alvis web site: http://www.alvis.info
AUTHORS
Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr>
LICENSE
Copyright (C) 2006 by Thierry Hamon
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.