NAME

nat-create - Command line tool to create NATools Corpora Objects

SYNOPSIS

nat-create <file1.nat> <file2.nat>

nat-create -tmx <file.tmx>

DESCRIPTION

This is the basic command used to create a NATools Corpora Object from the command line.

A NATools Corpora Object is a ditectory with:

  • the configuration file ("nat.cnf" - metadata information)

  • the corpus

  • the corpus indexes

  • the probabilistic translation dictionaries ("source-target.dmp", "target-source.dmp")

  • the (bi,tri,tetra)grams databases ("source.ngrams", "target.ngrams")

Known Switches

tokenize

The -tokenize flag can be used to force NATools to tokenize the texts. Note that at the moment a Portuguese tokenizer is used for all languages. This might change in the future.

id

The -id=name flag can be used to force NATools Corpora name. By default the name is read interactively.

q

The -q flag can be used to force quiet mode. In thic case, the name is extracted from the file-names.

lang

The -lang=PT..EN flag can be used to force languages.

ngrams

The -ngrams flag can be set to force NATools to create ngrams indexes.

noEM

The -noEM flag is used to bypass the EM-Algorithm (useful for debug purposes, mainly).

ipfp

The -ipfp flag is mutually exclusive with -noEM, -samplea and -sampleb. It defines that the EM-Algorithm to be used is the IPFP one. Optional numeric argument is the number of iterations. Defaults to 5.

samplea

The -samplea flag is mutually exclusive with -noEM, -ipfp and -sampleb. It defines that the EM-Algorithm to be used is the Sample A one. Optional numeric argument is the number of iterations. Defaults to 10.

sampleb

The -sampleb flag is mutually exclusive with -noEM, -ipfp and -samplea. It defines that the EM-Algorithm to be used is the Sample B one. Optional numeric argument is the number of iterations. Defaults to 10.

SEE ALSO

NATools documentation, perl(1)

AUTHOR

Alberto Manuel Brandão Simões, <ambs@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2006-2011 by Alberto Manuel Brandão Simões

1 POD Error

The following errors were encountered while parsing the POD:

Around line 249:

Non-ASCII character seen before =encoding in 'Brandão'. Assuming UTF-8