NAME

Alvis::TermTagger - Perl extension for tagging terms in a corpus

SYNOPSIS

use Alvis::TermTagger;

Alvis::TermTagger::termtagging($termlist, $outputfile);

DESCRIPTION

This module is used to tag a corpus with terms. Corpus (given on the STDIN) is a file with one sentence per line. Term list ($termlist) is a file containing one term per line. For each term, additional information can be given after a column. Each line of the output file ($outputfile) contains the sentence number and the term separated by a tabulation character.

This module is mainly used in the Alvis NLP Platform.

METHODS

termtagging()

termtagging($term_list_filename, $output_filename);

This is the main method of module. It loads the term list ($term_list_filename) and tags the corpus ($corpus_filename). It produces the list of matching terms and the sentence offset where the terms can be found. The file $output_filename contains this output.

load_TermList()

load_TermList($term_list_filename,\@term_list);

This method loads the term list ($term_list_filename is the file name) in the array given by reference (\@term_list).

get_Regex_TermList()

get_Regex_TermList(\@term_list, \@regex_term_list);

This method generates the regular expression from the term list (\@term_list). stored in the specific array (\@regex_term_list)

load_Corpus()

load_Corpus($corpus_filename\%corpus, \%lc_corpus);

This method loads the corpus ($corpus_filename) in hashtable (\%corpus) and prepares the corpus in lower case (recorded in a specific hashtable, \%lc_corpus)

corpus_Indexing()

corpus_Indexing(\%lc_corpus, \%corpus_index);

This method indexes the lower case version of the corpus (\%lc_corpus) according the words \%corpus_index (the index is a hashtable given by reference).

term_Selection()

term_Selection(\%corpus_index, \@term_list, \%idtrm_select);

This method selects the terms from the term list (\@term_list) potentially appearing in the corpus (that is the indexed corpus, \%corpus_index). Results are recorded in the hash table \%idtrm_select.

term_tagging_offset()

term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename);

This method tags the corpus \%corpus with the terms (issued from the term list \@term_list, \@regex_term_list is the term list with regular expression), and selected in a previous step (\%idtrm_select). Resulting selected terms are recorded with their offset in the file $output_filename.

SEE ALSO

Alvis web site: http://www.alvis.info

AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr>

LICENSE

Copyright (C) 2006 by Thierry Hamon

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.