NAME
Text::Perfide::PartialAlign - Split large bitexts into smaller files.
VERSION
Version 0.01_03
SYNOPSIS
Perhaps a little code snippet.
use Text::Perfide::PartialAlign;
my $foo = Text::Perfide::PartialAlign->new();
...
EXPORT
A list of functions that can be exported. You can delete this section if you don't export anything, such as for a purely object-oriented module.
SUBROUTINES/METHODS
build_chain
calc_common_tokens
calc_pairs
subcorpora2files
Writes subcorpora to files.
usage
Prints a short description and usage details.
tokenFreq
Receives an array of lines of a text (each line is an array of words). Calculates the frequency of each word.
hapaxes
Receives hash token => freq. Returns hash with elements with freq == 1
hapaxPositions
Builds an hash with term => positions, where position is the number of the sentence in which term occurs.
bagSort
...
uniqSort
Sorts an array of pairs and removes duplicated pairs.
less
Receives two pairs. Checks if both coordinates of the first pair are lower than the second pair.
less_relaxed
Receives two pairs...
less_or_equal
Receives two pairs. Checks if both coordinates of the first pair are lower or equal than the second pair's.
maximalChain
Receives an array of pairs. Using dynamic programming, selects the maximal chain.
findCommonHap
Finds unique terms common to both corpora. Notion of equality can be extended with two lists of correspondences.
findCommonHap($l1Hap,$l2Hap)
Returns a reference to a hash containing the elements common to the hashes pointed by the references $l1Hap and $l2Hap.
findCommonHap($l1Hap,$l2Hap,$l1_to_l2,$l2_to_l1)
$l1_to_l2 and $l2_to_l1 are references to hashes containing correspondences between words in language1 and language2 and vice-versa.
selectFromChain
Selects a chain trying to obbey the maximalChunkSize constraint.
get_corpus
Given a file name, splits the segments and words into an array of arrays.
Returns: a reference to the array of arrays, a reference to an array of pairs with the offsets of the start and end of each segment, a reference to the full text
strInterval
Given a corpus and a start and end positions, returns a string with the contents within the given range.
strInterval($corpus,$first,$last)
Concatenates all the words in the lines comprised in the $first..$last-1 range from corpus.
strInterval($corpus,$first,$last,$offsets);
Retrieves from the original text the substring from the begining of the segment $first to the end of the segment $last;
parseCorrespFile
Parses a given file with correspondences between two given languages. File must follow the following DSL: file : header correspondence* header: 'langs:' L1, L2 correspondence : term (',' term)* '=' term (',' term)* term : word (\s word)*
Does not yet support multi-word terms nor multi-term correspondences!
seg_split
token_split
AUTHOR
Andre Santos, <andrefs at cpan.org>
BUGS
Please report any bugs or feature requests to bug-text-perfide-partialalign at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Perfide-PartialAlign. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Text::Perfide::PartialAlign
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-Perfide-PartialAlign
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
Based on the original script partialAlign.py bundled with hunalign -- http://mokk.bme.hu/resources/hunalign/ .
Thanks to Daniel Varga for helping us to understand how partialAlign.py works.
LICENSE AND COPYRIGHT
Copyright 2012 Andre Santos.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.