NAME

Alvis::TermTagger - Perl extension for tagging terms in a corpus

SYNOPSIS

use Alvis::TermTagger;

Alvis::TermTagger::termtagging($termlist, $outputfile);

DESCRIPTION

This module is used to tag a corpus with terms. Corpus (given on the STDIN) is a file with one sentence per line. Term list ($termlist) is a file containing one term per line. For each term, additionnal information (as canonical form or semantic tag) can be given after the first column. This information can be separated by either a column, either by a vertical bar. Each line of the output file ($outputfile) contains the sentence number, the term, additional information, all separated by a tabulation character.

This module is mainly used in the Alvis NLP Platform.

METHODS

termtagging()

termtagging($term_list_filename, $output_filename);

This is the main method of module. It loads the term list ($term_list_filename) and tags the corpus ($corpus_filename). It produces the list of matching terms and the sentence offset (and additional information given in the input file) where the terms can be found. The file $output_filename contains this output.

load_TermList()

load_TermList($term_list_filename,\@term_list);

This method loads the term list ($term_list_filename is the file name) in the array given by reference (\@term_list). Each element of term list contains a reference to a two element array (the term and its canonical form).

get_Regex_TermList()

get_Regex_TermList(\@term_list, \@regex_term_list);

This method generates the regular expression from the term list (\@term_list). stored in the specific array (\@regex_term_list)

load_Corpus()

load_Corpus($corpus_filename\%corpus, \%lc_corpus);

This method loads the corpus ($corpus_filename) in hashtable (\%corpus) and prepares the corpus in lower case (recorded in a specific hashtable, \%lc_corpus)

corpus_Indexing()

corpus_Indexing(\%lc_corpus, \%corpus_index);

This method indexes the lower case version of the corpus (\%lc_corpus) according the words \%corpus_index (the index is a hashtable given by reference).

print_corpus_index()

print_corpus_index(\%corpus_index);

This method prints on STDERR the corpus index \%corpus_index.

term_Selection()

term_Selection(\%corpus_index, \@term_list, \%idtrm_select);

This method selects the terms from the term list (\@term_list) potentially appearing in the corpus (that is the indexed corpus, \%corpus_index). Results are recorded in the hash table \%idtrm_select.

term_tagging_offset()

term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename);

term_tagging_offset_tab()

term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results);

term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results);

This method tags the corpus \%corpus with the terms (issued from the term list \@term_list, \@regex_term_list is the term list with regular expression), and selected in a previous step (\%idtrm_select). Resulting selected terms are recorded with their offset, and additional information in the array @tab_results (values are sentence id, selected terms and additional information separated by tabulation) or in the hashtable %tabh_results (keys form is "sentenceid_selectedterm", values are an array reference containing sentence id, selected terms and additional ifnormation).

printMatchingTerm

printMatchingTerm($descriptor, $ref_matching_term, $sentence_id);

This method prints into the file descriptor $descriptor, the sentence id ($sentence_id) and the matching term (named by its reference $ref_matching_term). Both data are on a line and are separated by a tabulation character.

printMatchingTerm_tab

printMatchingTerm_tab($ref_matching_term, $sentence_id, $ref_tab_results);

This method stores into $ref_tab_results, the sentence id ($sentence_id) and the matching term (named by its reference $ref_matching_term). $ref_tab_results can be a array or a hash table. In case of an array, both data are concatanated in a line and are separated by a tabulation character. In case of a hash table, both data are stored in an array, hash key is the concatenation of the sentence id and the matching term.

AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.

To install Alvis::TermTagger, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Alvis::TermTagger

CPAN shell

perl -MCPAN -e shell
install Alvis::TermTagger

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)