NAME

NATools - A framework for Parallel Corpora processing

ABSTRACT

NATools is a package of tools to process parallel corpora. It
includes a sentence aligner, a probabilistic translation dictionary
extraction tool, a terminology extraction tool and some other
functionalities.

DESCRIPTION

This is a collection of functions used on the NATools tools. Some of them can be used independently. Check documentation bellow.

init

Use this function to initialize a parallel corpora repository. You must supply a directory where the repository will reside, and its name:

my $pcorpus = Lingua::NATools->init("/var/corpora", "myPCorpus")

This would create a directory named /var/corpora/myPCorpus with a configuration file, and returns a blessed object.

To add texts to this empty repository use the codify method.

load

This function loads information from a NAT repository. Call it with the directory where the repository was created.

my $pcorpus = Lingua::NATools->load("/var/corpora/EuroParl-PT.EN");

codify

This method is used to add a pair of NATools style texts to a parallel corpora repository. The files should be sentence-aligned, with each sentence separated by a $ in a line by itself.

The method is called in a repository object, and with two mandatory arguments: the two file names for the two chosen languges. Note that this method does not verify the corpora languages, so you must be coherent when calling it. The third and optional argument verbose should be true if you want this function to print details on progress to Stdout.

The method dies if the files does not exist or if the number of sentences on both files differ.

Example of invocation:

$pcorpus -> codify({ignore_case => 1},
                   "/var/corpora/Europarl.PT",
                   "/var/corpora/Europarl.EN");

count_sentences

This auxiliary function is used to count sentences on two NATools sentence-aligned files. If the two files have the same number of sentences that number is returned. If not, undef is given.

An optional third argument can be given. That is a boolean value stating if some verbose output should be printed in StdErr.

my $nr = count_sentences("/var/corpora/EuroParl.PT",
                         "/var/corpora/EuroParl.EN", 1);

user_conf

Returns a hash reference with .natrc contents. You might pass the home directory as parameter, or directly the configuration file.

calc_chunks

This auxiliary method receives the number of sentences in a corpora and returns the number of chunks to be created.

my $nrchunks = $nat->calc_chunks($nrsentences);

index_invindexes

Each process of encoding chunks creates an inverted search index. This method should be called to re-index all these indexes in a common one.

Just call it in the repository object. If needed, you can supply a true argument so the function will be verbose.

$pcorpus -> index_invindexes;

index_ngrams

This method calculates ngrams (bigrams, trigrams and tetragrams) for both languages and ALL chunks.

$pcorpus -> index_ngrams;

split_corpus_simple

This method is called by the codify method to split the corpora into chunks. Note that this method should be called for any number of chunks, including the singular one.

The method receives an hash reference with configuration values, and the two text files with the text to be tokenized. The hash should include, at least, the number of chunks, and the chunk currently being processed.

$pcorpus -> split_corpus_simple({tokenize => 0,
                                 verbose => 1,
                                 chunk => 1, nrchunks => 16},
                                    "/var/corpora/EuroParl.PT",
                                    "/var/corpora/EuroParl.EN");

run_initmat

This method invoques the C program nat-initmat for a specific chunk. You must supply the chunk number, and it should exist. It returns the time used to run the command.

$pcorpus->run_initmat(3);

run_mat2dic

This method invoques the C program nat-mat2dic for a specific chunk. You must supply the chunk number, and it should exist. It returns the time used to run the command.

$pcorpus->run_mat2dic(4);

run_post

This method invoques the C program nat-postbin for a specific chunk. You must supply the chunk number, and it should exist. It returns the time used to run the command.

$pcorpus->run_post(5);

run_generic_EM

This method invoques one of the three algorithms for Entropy Maximization of the alignment matrix: nat-sampleA, nat-sampleB and nat-ipfp.

You should call the method with the name of the algorithm ("sampleA", "sampleB" or "ipfp"), the number of iterations to be done, and the chunk to be processed.

Returns the time used to run the command.

$pcorpus->run_generic_EM("ipfp", 5, 3);

align_all

This method will re-align all chunks in the corpora repository. It will not re-encode them, just re-align.

$pcorpus -> align_all;

align_chunk

This method will re-align a specific chunk in the corpora repository. It will not re-encode it, just re-align.

You need to give a first argument with the chunk number to be aligned, and a optional second argument stating if you want verose output.

$pcorpus -> align_chunk(3,0);

run_dict_add

This method appends a chunk to both languages dictionaries (not NATdicts). You must supply a chunk number (and it should exist). The method should not be called directly. Or, if really needed, call it for all chunks, one at a time, starting with the first.

for (1..10) {
  $pcorpus -> run_dict_add($_)
}

make_dict

This method creates the corpora dictionaries (not NATDicts). The method is called directly in the object with an optional argument to force verbose output if needed. This method will call run_dict_add for each chunk.

$pcorpus -> make_dict;

pre_chunk

This function does the encoding for each created chunk. It is called internally by the codify method. You should call it with the home directory for the parallel corpora repository and the chunk identifier.

pre_chunk({ ignore_case => 1}, "/var/corpora/EuroParl", 4);

dump_ptd

This function calls nat-dumpDicts command to dump a PTD for the current corpus.

$self -> dump_ptd( );

time_command

This is a system like function. Pass a command and it gets executed. Also, the time of the execution is returned.

my $time = time_command("nat-pre... ");

Aligning corpora files

The align constructor is used to align two parallel, sentence aligned corpora. Use it with

use NAT;
Lingua::NATools->align("EN", "PT");

where EN and PT are parallel corpora files. These files syntax is a sequence of sentences, divided by lines with the dollar sign.

First sentence
$
Second sentence

Last argument (optional) is an hash table reference with align options. For example, you can pass a reference to a processing function to be applied to each sentence in the source or target corpus:

use NAT;
Lingua::NATools->align("EN", "PT", { filter1 => sub{ ... },
                                     filter2 => sub{ ... } });

Note that you can use just one filter.

Checking translation probability

The check_bidirectional_sentence_similarity function is used to get a probability of translation between to sentences. The algorithm uses the probable translations obtained from the word alignment.

First argument is a reference to a configuration hash. Other two, are the sentences to be compared, in the source and in the target languages, respectively.

$prob = NAT::check_bidirectional_sentence_similarity( +{
                            sourceDB => 'dic1.db', targetDB => 'dic2.db',
                            }, "first sentence", "primeira frase");

The line above defines the DB File dictionaries (created with the createDB tool --- merge_dict_lex function) to be used, and the two sentences to be compared.

On some cases, it is desirable to ignore small words on sentences. On that case, you can pass the ignore_size option in the hash, with the minimum size required to the word to be considered.

On some other cases, you do not want to ignore small words, but some special ones. On that case, you can define two arrays, named sourceStopWrds and targetStopWrds with the words to be ignored.

Loading Dictionary Files

NATools creates files containing a Perl Data Dumper dictionary with translation probabilities. This function reads it and returns it.

$dic = NAT::load_dict('dic-ipfp.1.pl');

In some cases it could be usefull to write a DB file with the hash information. In these cases use:

NAT::load_dict( { dbfile => 'DB' }, 'dic-ipfp.1.pl')

and the file 'DB' is created.

Loading Lexical Files

NATools creates files containing the corpora lexicon. These files are stored the the created directory, with the name of the corpus file and extension .lex. While small, these files are gziped binary files, which are easy to read from C, but sometimes tricky to read from Perl.

So, you can use the load_lex function from this module to do this. If you use it simply with:

$lex = NAT::load_lex($file);

it will return an hash reference for the file information, where keys are the lexicon words. Data is another hash reference with the following structure:

{ count => word_occurrence, id => word_identifier }

You can use as the first argument to the function a reference to an hash with configuration options. For example,

$lex = NAT::load_lex( { id => 1 }, $file )

will return an hash reference where the keys are the word identifiers, and each data is an hash reference with the structure:

{ count => word_occurrence, word => the_word }

Additionally, you can supply as an option the key dbfile pointing to a filename. In this case, the structure is returned but the file is also created with an MLDBM (Storable + DB_File) for the data structure;

Merging Dictionary and Lexicon Files

This function is specially useful to create a MLDBM File or a Storable file with the information of the lexicon and the terminologic dictionary putted together.

To use it you must supply the dictionary file name (created with one of the three alignment methods) and the lexicon file. The function returns a perl structure with the created dictionary.

$dict = NAT::merge_dict_lex("dict.ipfp.1.pl", "corpus1.lex");

If you want the output on a DB file, use:

NAT::merge_dict_lex( +{dbfile => "filename.db"},
                     "dict.ipfp.1.pl", "corpus1.lex");

On the same mood, use the following line for Storable output:

NAT::merge_dict_lex( +{store => "filename.db"},
                     "dict.ipfp.1.pl", "corpus1.lex");

AUTHOR

Alberto Manuel Brandão Simões, <ambs@cpan.org>

COPYRIGHT AND LICENSE

Copyright 2002-2012 Alberto Simões

This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.