NAME

Lingua::NATools::PTD - Module to handle PTD files in Dumper Format

SYNOPSIS

use Lingua::NATools::PTD;

$ptd = Lingua::NATools::PTD->new( $ptd_file );

DESCRIPTION

PTD files in Perl Dumper format are simple hashes references. But they use a specific structure, and this module provides a simple interface to manipulate it.

`new`

The new constructor returns a new Lingua::NATools::PTD object. This constructor receives a PTD file in dumper format.

my $ptd = Lingua::NATools::PTD->new( $ptd_file );

If the filename matches with /dmp.bz2$/ (that is, ends in dmp.bz2) it is considered to be a bzip2 file and will be decompressed in the fly.

If it ends in <.sqlite>, then it is supposed to contain an SQLite file with the dictionary (with Lingua::NAToolsools standard schema!).

Extra arguments are a flatenned hash with configuration variables. Following options are recognized:

verbose

Sets verbosity.

my $ptd = Lingua::NATools::PTD->new( $ptd_file, verbose => 1 );

`verbose`

With no arguments returns if the methods are configured to use verbose mode, or not. If an argument is supplied, it is interpreted as a boolean value, and sets methods verbosity.

$ptd->verbose(1);

`dump`

The dump method is used to write the PTD in its own format, but taking care to sort words lexicographically, and sorting translations by their probability (starting with higher probabilities).

The format is Perl code, and thus, can be used independetly of this module.

$ptd->dump;

Note that the dump method writes to the Standard Output stream.

`words`

The words method returns an array (not a reference) to the list of words of the dictionary: its domain. Pass a true value as argument and the list is returned sorted.

my @words = $ptd->words;

`trans`

The trans method receives a word, and returns the list of its possible translations.

my @translations = $ptd->trans( "dog" );

`exists`

Checks if a word is in a dictionary

`transHash`

The transHash method receives a word, and returns an hash where keys are the its possible translations, and values the corresponding translation probabilities.

my %trans = $ptd->transHash( "dog" );

Returns the empty hash if the word does not exist.

`prob`

The prob method receives a word and a translation, and returns the probability of that word being translated that way.

my $probability = $ptd->prob("cat", "gato");

`size`

Returns the total number of words from the source-corpus that originated the PTD. Basically, the sum of the count attribute for all words.

my $size = $ptd->size;

`count`

The count method receives a word and returns the occurrence count for that word.

my $count = $ptd->count("cat");

If no argument is supplied, returns the total dictionary count (sum of all words).

`stats`

Computes a bunch of statistics about the PTD and returns them in an hash reference.

`subtractDomain`

This method subtracts to the domain of a PTD, the elements present on a set of elements. This set can be defines as another PTD (domain is used), as a Perl array reference, as a Perl hash reference (domain is used) or as a Perl array (not reference). Returns the dictionary after domain subtraction takes place.

# removes portuguese articles from the dictionary
$ptd->subtractDomain( qw.o a os as. );

# removes a set of stop words from the dictionary
$ptd->subtractDomain( \@stopWords );

# removes the words present on other_ptd from ptd
$ptd->subtractDomain( $other_ptd );

`restrictDomain`

Domain restrict function: interface is similar to subtractDomain function

This method restricts the domain of a PTD to a set of elements. This set can be defines as another PTD (domain is used), as a Perl array reference, as a Perl hash reference (domain is used) or as a Perl array (not reference). Returns the dictionary after domain restriction takes place.

# removes portuguese articles from the dictionary
$ptd->subtractDomain( qw.o a os as. );

# removes a set of stop words from the dictionary
$ptd->subtractDomain( \@stopWords );

# removes the words present on other_ptd from ptd
$ptd->subtractDomain( $other_ptd );

`reprob`

This method recalculates all probabilities accordingly with the number of translations available.

For instance, if you have

home => casa => 25%
     => lar  => 25%

The resulting dictionary will have

home => casa => 50%
     => lar  => 50%

Note that this methods replaces the object.

`intersect`

This method intersects the current object with the supplied PTD. Note that this method replaces the object values.

Occurrences count in the final dictionary is the minimum occurrence value of the two dictionaries.

Only translations present on both dictionary are kept. The probability will be the minimum on the two dictionaries.

`add`

This method adds the current PTD with the supplied one (first argument). Note that this method replaces the object values.

`downtr`

This method iterates over a dictionary and calls the function supplied as argument. This function will receive, in each call, the word in the source language, the number of occurrences, and the hash of translations.

$ptd->downtr( sub { my ($w,$c,%t) = @_;
                    if ($w =~ /[^A-Za-z0-9]/) {
                        return undef;
                    } else {
                        return toentry($w,$c,%t);
                    }
            },
           filter => 1);

Set the filter flag if your downtr function is replacing the original dictionary.

`toentry`

This function is exported by default and creates a dictionary entry given the word, word count, and hash of translations. Check downtr for an example.

`saveAs`

Method to save a PTD in another format. First argument is the name of the format, second is the filename to be used. Supported formats are <dmp> for Perl Dump format, <bz2> for Bzipped Perl Dump format, <xz>, for Lzma xz Perl Dump format and <sqlite> for SQLite database file.

Return undef if the format is not known. Returns 0 if save failed. A true value in success.

`lowercase`

This method replaces the dictionary, in place, lowercasing all entries. This is specially usefull to process transation dictionaries obtained with the -utf8 flag that (at the moment) does case sensitive alignment.

$ptd->lowercase(verbose => 1);

`ucts`

Create unambiguous-concept traslation sets.

my $result =  ucts($ptd1, $ptd2, m=>0.1, M=>0.8);

Available options are:

m: Mininum number of occurences of each token. Must be an integer (default: 10).
M: Manixum number of occurences of each token. Must be an integer (default: 100).
p: Minimum probabilty for translation. Must be a probability in the interval [0,1] (default: 0.2).
P: Minimum probabilty for the inverse translations. Must be a probability in the interval [0,1] (default: 0.8).
r=0|1: Print rank (default: 0).
pp=0|1: Pretty print output (default: 0).
output=filename: Pretty print output to file filename.

`bws`

Create bi-words sets given a PTD pair.

my $result = bws($ptd1, $ptd2, m=>0.1, p=>0.4);

$ptd1 and $ptd2 can be filenames for the PTDs or already create PTD objects.

The following options are available:

m: Mininum number of occurences of each token. Must be an integer (default: 10).
p: Minimum probabilty for translation. Must be a probability in the interval [0,1] (default: 0.4).
r=0|1: Print rank (default: 0).
pp=0|1: Pretty print output (default: 0).
output=filename: Pretty print output to file filename.

AUTHOR

Alberto Manuel Brandão Simões, <ambs@cpan.org>

COPYRIGHT AND LICENSE

To install Lingua::NATools, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::NATools

CPAN shell

perl -MCPAN -e shell
install Lingua::NATools

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

new

verbose

dump

words

trans

exists

transHash

prob

size

count

stats

subtractDomain

restrictDomain

reprob

intersect

add

downtr

toentry

saveAs

lowercase

ucts

bws

SEE ALSO