NATools understands two file formats for corpora: TMX and NATools specific format. TMX is a standard, and you can see its specifications at LISA.
Regarding NATools specific format: use two files, one for each language. Each translation unit is separated by a line with just a dollar sign ($). Each translation unit can span for more than one line. That is not a problem.
Here is a simple example:
I saw a cat .
$
The cat was
fat .
$
Eu vi um
gato .
$
O gato era gordo .
$
Note that both files need to have the same number of translation units, and that the texts should be already tokenized.
If you have a TMX file (with just two languages) you can bootstrap the NATools alignment process using the nat-create script:
[foo@bar]$ nat-create -tmx file.tmx
The script will ask you for a name for the corpus. Supply a name without spaces. The script will create a directory with that name, where the files for the encoded corpus, encoded lexicon and probabilistic translation dictionaries.
To use this method, you need to have a pair of files aligned at sentence level, in the format specified above. For the following commands examples, we will call these files lang1 and lang2.
You can align them directly using the built-in language identification process:
[foo@bar]$ nat-create lang1 lang2
You can also specify the languages in case you want speed, or in case the language identification process does not guess correctly the languages involved. For that, you should use:
[foo@bar]$ nat-create -langs=PT..EN lang1 lang2
where the -langs switch specify the languages involved in the same order as the supplied files (so, lang1 should be Portuguese, and lang2 should be in English).
Both methods will ask you for a corpus name. Supply a name without spaces. The script will create a directory with that name, where the files for the encoded corpus, encoded lexicon and probabilistic translation dictionaries.
In some cases it is useful to look at the Probabilistic Translation Dictionary (PTD) extracted from the parallel corpus without using the NATools server. For this, we can extract the PTD to a textual file (in Perl Data::Dumper format which is both legible to the human and to the computer).
Use the nat-dumpDicts command for that. First, change the current directory to the directory created by the corpus encoding process, and then execute:
[foo@bar]$ nat-dumpDicts source.lex source-target.bin target.lex
target-source.bin > dict.txt
The file dict.txt will be created with the PTD.
If you read the installation section, you know that the CGIs work based on a server running in your machine. There are other tools that need this server as well, so that they are quicker when accessing the corpus.
The server needs a configuration file. The configuration file is simple. Lines starting with a sharp (#) are considered to be comments, and thus ignored. Other lines should contain absolute paths to directories created by the nat-create command (or nat-shell). For instance, if running nat-create you created a corpus in the directory /corpora/parallel with name EuroParl, you should add the following line to your configuration file:
/corpora/parallel/EuroParl
The server will then configure each corpus based on the nat.cnf configuration file present in each of those corpus directories.
To start the server, use:
[foo@bar]$ nat-server /path/to/the/config/file.cfg
5 POD Errors
The following errors were encountered while parsing the POD:
- Around line 30:
Unknown directive: =File
- Around line 60:
Unknown directive: =Bootstrapping
- Around line 72:
Unknown directive: =Bootstrapping
- Around line 98:
Unknown directive: =Creating
- Around line 114:
Unknown directive: =Using