NAME

Lingua::PT::PLN - Perl extension for NLP of the Portuguese Language

SYNOPSIS

use Lingua::PT::PLN;

# occurrence counter
%o = oco("file");
oco({num=>1,output=>"outfile"},"file");

$st = syllable($phrase);
$s = accent($phrase);
$s = wordaccent($word);

$s = xmlsentences($textstring);
$s = xmlsentences({st=>"frase"},$textstring);
@s = sentences($textstring);


perl -MLingua::PT::PLN -e 'cqptokens("file")' > out

DESCRIPTION

This is a module for Natural Language Processing of the Portuguese.

Because you are processing Portuguese, you must use a correct locale.

Occurrence counting: oco

Counts word occurrence from a string or a set of files. Returns an hash with the information or creates a sorted file with the results.

This function takes optionally as first argument an hash of options where you can specify:

num => 1

means the output should be sorted by ocurrence number;

alpha => 1

mean the output should be sorted lexicographically

output => "f"

means the output will be written to the file "f";

from => "string"

means that next argument (after the option hash) is a string which should be used as input for the function.

from => "file"

means that remaining arguments to the function are filenames which should be used as input for the function. This is the default option.

Examples:

oco({num=>1,output=>"f"}, "f1","f2")
# sort by occurrence
# store output on file "f"
# process files "f1" and "f2"

oco({alpha=>1,output=>"f"}, "f1","f2")
# sort lexicographically
# store output on file "f"
# process files "f1" and "f2"

%oc = oco("f1","f2")
# return a hash with the occurrences
# use "f1" and "f2" as input files

%oc = oco( {from=>"string"},"text in a string")
# use a string as input
# return a hash with the occurrences

syllable

my $sylls = syllable( $phrase )

Returns the phrase with the syllables separated by "|"

accent

my $accent = accent( $phrase )

Returns the phrase with the syllables separated by "|" and accents marked with the charater ".

wordaccent

Retuns the word splited into syllables and with the accent character marked.

setabrev

compacta

compara

tokenize

This function is a tokenizer for Portuguese text;

cqptokens()

cpqtokens - encodes a text from STDIN for CQP (one token per line)

sentences()

sentences - ....

xmlsentences()

xmlsentences - ....

By default, sentences are marked with "s". To change this use st optional parameter. Example:

xmlsentences({st=> "tag"}, text)

to mark sentences with tag "tag".

AUTHOR

Projecto Natura (http://natura.di.uminho.pt)

Alberto Simoes (albie@alfarrabio.di.uminho.pt)

José João Almeida (jj@di.uminho.pt)

Paulo Rocha (paulo.rocha@di.uminho.pt)

SEE ALSO

perl(1). cqp(1).

1 POD Error

The following errors were encountered while parsing the POD:

Around line 511:

Non-ASCII character seen before =encoding in 'José'. Assuming CP1252