NAME
NATools - A framework for Parallel Corpora processing
ABSTRACT
NATools is a package of tools to process parallel corpora. It
includes a sentence aligner, a probabilistic translation dictionary
extraction tool, a terminology extraction tool and some other
functionalities.
DESCRIPTION
This is a collection of functions used on the NATools tools. Some of them can be used independently. Check documentation bellow.
init
Use this function to initialize a parallel corpora repository. You must supply a directory
where the repository will reside, and its name
:
my $pcorpus = Lingua::NATools->init("/var/corpora", "myPCorpus")
This would create a directory named /var/corpora/myPCorpus
with a configuration file, and returns a blessed object.
To add texts to this empty repository use the codify
method.
load
This function loads information from a NAT repository. Call it with the directory where the repository was created.
my $pcorpus = Lingua::NATools->load("/var/corpora/EuroParl-PT.EN");
codify
This method is used to add a pair of NATools style texts to a parallel corpora repository. The files should be sentence-aligned, with each sentence separated by a $
in a line by itself.
The method is called in a repository object, and with two mandatory arguments: the two file names for the two chosen languges. Note that this method does not verify the corpora languages, so you must be coherent when calling it. The third and optional argument verbose
should be true if you want this function to print details on progress to Stdout
.
The method dies if the files does not exist or if the number of sentences on both files differ.
Example of invocation:
$pcorpus->codify({ignore_case => 1},
"/var/corpora/Europarl.PT",
"/var/corpora/Europarl.EN");
count_sentences
This auxiliary function is used to count sentences on two NATools sentence-aligned files. If the two files have the same number of sentences that number is returned. If not, undef
is given.
An optional third argument can be given. That is a boolean value stating if some verbose output should be printed in StdErr
.
my $nr = count_sentences("/var/corpora/EuroParl.PT",
"/var/corpora/EuroParl.EN", 1);
user_conf
Returns a hash reference with .natrc contents. You might pass the home directory as parameter, or directly the configuration file.
calc_chunks
This auxiliary method receives the number of sentences in a corpora and returns the number of chunks to be created.
my $nrchunks = $nat->calc_chunks($nrsentences);
index_invindexes
Each process of encoding chunks creates an inverted search index. This method should be called to re-index all these indexes in a common one.
Just call it in the repository object. If needed, you can supply a true argument so the function will be verbose.
$pcorpus->index_invindexes;
index_ngrams
This method calculates ngrams (bigrams, trigrams and tetragrams) for both languages and ALL chunks.
$pcorpus->index_ngrams;
split_corpus_simple
This method is called by the codify
method to split the corpora into chunks. Note that this method should be called for any number of chunks, including the singular one.
The method receives an hash reference with configuration values, and the two text files with the text to be tokenized. The hash should include, at least, the number of chunks, and the chunk currently being processed.
$pcorpus->split_corpus_simple({tokenize => 0,
verbose => 1,
chunk => 1, nrchunks => 16},
"/var/corpora/EuroParl.PT",
"/var/corpora/EuroParl.EN");
run_initmat
This method invoques the C program nat-initmat
for a specific chunk. You must supply the chunk number, and it should exist. It returns the time used to run the command.
$pcorpus->run_initmat(3);
run_mat2dic
This method invoques the C program nat-mat2dic
for a specific chunk. You must supply the chunk number, and it should exist. It returns the time used to run the command.
$pcorpus->run_mat2dic(4);
run_post
This method invoques the C program nat-postbin
for a specific chunk. You must supply the chunk number, and it should exist. It returns the time used to run the command.
$pcorpus->run_post(5);
run_generic_EM
This method invoques one of the three algorithms for Entropy Maximization of the alignment matrix: nat-sampleA
, nat-sampleB
and nat-ipfp
.
You should call the method with the name of the algorithm ("sampleA", "sampleB" or "ipfp"), the number of iterations to be done, and the chunk to be processed.
Returns the time used to run the command.
$pcorpus->run_generic_EM("ipfp", 5, 3);
align_all
This method will re-align all chunks in the corpora repository. It will not re-encode them, just re-align.
$pcorpus->align_all;
align_chunk
This method will re-align a specific chunk in the corpora repository. It will not re-encode it, just re-align.
You need to give a first argument with the chunk number to be aligned, and a optional second argument stating if you want verose output.
$pcorpus->align_chunk(3,0);
run_dict_add
This method appends a chunk to both languages dictionaries (not NATdicts). You must supply a chunk number (and it should exist). The method should not be called directly. Or, if really needed, call it for all chunks, one at a time, starting with the first.
for (1..10) {
$pcorpus->run_dict_add($_)
}
make_dict
This method creates the corpora dictionaries (not NATDicts). The method is called directly in the object with an optional argument to force verbose output if needed. This method will call run_dict_add
for each chunk.
$pcorpus->make_dict;
pre_chunk
This function does the encoding for each created chunk. It is called internally by the codify
method. You should call it with the home directory for the parallel corpora repository and the chunk identifier.
pre_chunk({ ignore_case => 1}, "/var/corpora/EuroParl", 4);
dump_ptd
This function calls nat-dumpDicts command to dump a PTD for the current corpus.
$self->dump_ptd( );
time_command
This is a system
like function. Pass a command and it gets executed. Also, the time of the execution is returned.
my $time = time_command("nat-pre... ");
Aligning corpora files
The align
constructor is used to align two parallel, sentence aligned corpora. Use it with
use NAT;
Lingua::NATools->align("EN", "PT");
where EN
and PT
are parallel corpora files. These files syntax is a sequence of sentences, divided by lines with the dollar sign.
First sentence
$
Second sentence
Last argument (optional) is an hash table reference with align options. For example, you can pass a reference to a processing function to be applied to each sentence in the source or target corpus:
use NAT;
Lingua::NATools->align("EN", "PT", { filter1 => sub{ ... },
filter2 => sub{ ... } });
Note that you can use just one filter.
Checking translation probability
The check_bidirectional_sentence_similarity
function is used to get a probability of translation between to sentences. The algorithm uses the probable translations obtained from the word alignment.
First argument is a reference to a configuration hash. Other two, are the sentences to be compared, in the source and in the target languages, respectively.
$prob = NAT::check_bidirectional_sentence_similarity( +{
sourceDB => 'dic1.db', targetDB => 'dic2.db',
}, "first sentence", "primeira frase");
The line above defines the DB File dictionaries (created with the createDB tool --- merge_dict_lex function) to be used, and the two sentences to be compared.
On some cases, it is desirable to ignore small words on sentences. On that case, you can pass the ignore_size
option in the hash, with the minimum size required to the word to be considered.
On some other cases, you do not want to ignore small words, but some special ones. On that case, you can define two arrays, named sourceStopWrds
and targetStopWrds
with the words to be ignored.
Loading Dictionary Files
NATools creates files containing a Perl Data Dumper dictionary with translation probabilities. This function reads it and returns it.
$dic = NAT::load_dict('dic-ipfp.1.pl');
In some cases it could be usefull to write a DB file with the hash information. In these cases use:
NAT::load_dict( { dbfile => 'DB' }, 'dic-ipfp.1.pl')
and the file 'DB' is created.
Loading Lexical Files
NATools creates files containing the corpora lexicon. These files are stored the the created directory, with the name of the corpus file and extension .lex
. While small, these files are gziped binary files, which are easy to read from C, but sometimes tricky to read from Perl.
So, you can use the load_lex
function from this module to do this. If you use it simply with:
$lex = NAT::load_lex($file);
it will return an hash reference for the file information, where keys are the lexicon words. Data is another hash reference with the following structure:
{ count => word_occurrence, id => word_identifier }
You can use as the first argument to the function a reference to an hash with configuration options. For example,
$lex = NAT::load_lex( { id => 1 }, $file )
will return an hash reference where the keys are the word identifiers, and each data is an hash reference with the structure:
{ count => word_occurrence, word => the_word }
Additionally, you can supply as an option the key dbfile
pointing to a filename. In this case, the structure is returned but the file is also created with an MLDBM (Storable + DB_File) for the data structure;
Merging Dictionary and Lexicon Files
This function is specially useful to create a MLDBM File or a Storable file with the information of the lexicon and the terminologic dictionary putted together.
To use it you must supply the dictionary file name (created with one of the three alignment methods) and the lexicon file. The function returns a perl structure with the created dictionary.
$dict = NAT::merge_dict_lex("dict.ipfp.1.pl", "corpus1.lex");
If you want the output on a DB file, use:
NAT::merge_dict_lex( +{dbfile => "filename.db"},
"dict.ipfp.1.pl", "corpus1.lex");
On the same mood, use the following line for Storable output:
NAT::merge_dict_lex( +{store => "filename.db"},
"dict.ipfp.1.pl", "corpus1.lex");
AUTHOR
Alberto Manuel Brandão Simões, <ambs@cpan.org>
COPYRIGHT AND LICENSE
Copyright 2002-2014 Alberto Simões
This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.