NAME

Lingua::TT::TextAlignment - TT Utils: alignment raw text <-> tokenized text

SYNOPSIS

##========================================================================
## PRELIMINARIES

use Lingua::TT::TextAlignment;

##========================================================================
## Constructors etc.

$ta = CLASS_OR_OBJECT->new(%opts);
undef = $ta->clear();

##========================================================================
## Methods: I/O: RTT ("RAW \t TEXT \t ...", with %%$c= comments)

$str_escaped = escape_rtt($str);
$str_escaped = unescape_rtt($str);
$ta = $ta->toRttFile($filename_or_fh,%opts);
$ta = $ta->fromRttFile($filename_or_fh,%opts);

##========================================================================
## Methods: I/O: TT (+ offsets)

$ta = $ta->parseOffsetLines();
$ta = $CLASS_OR_OBJECT->fromTTFile($filename_or_fh,%opts);
$ta = $ta->toTTFile($filename_or_fh,%opts);

##========================================================================
## Methods: I/O: text-buffer

$ta = $ta->loadTextFile($filename_or_fh,%opts);
$ta = $ta->saveTextFile($filename_or_fh,%opts);

DESCRIPTION

The RTT "raw + text + tags" format is a "vertical" text format for combined storage of explicit token boundaries together with raw original (untokenized) text. It is a line-based formats with lines of the form:

"%%$RTT:COMPACT=" BOOL

RTT processing instruction declaring that this file is (or is not) in "compact" RTT format. BOOL is either 0 (zero) or 1 (one). If unspecified or false, file is assumed to be in "prolix" (non-compact) format.

"%%" COMMENT

Comments begin with "%%" and extend to the end of the line.

"%%$c=" STRING

Indicates a text string STRING in the raw text with no corresponding string in the tt-tokenization; STRING is typically whitespace, and may contain escaped newlines ("\n") or TABs ("\t").

TEXT "\t" TOK...

In "prolix" format, TAB-separated lines indicate aligned raw text and tokenized material. The first field is the raw text the token covers, and subsequent fields are the associated token attributes.

WHITESPACE? TOKTEXT...

In "compact" mode, token lines may be prefixed by optional whitespace, which is assumed to be present only in the raw text representation, and TOKTEXT is assumed to be identical to the raw text covered by the token. Equivalent to the "prolix" lines:

%%$c=WHITESPACE
TOKTEXT	TOKTEXT...
WHITESPACE? RAWTEXT " $= " TOKTEXT...

"Compact" format for non-identity tokenizations, equivalent to the "prolix" lines:

%%$c=WHITESPACE
RAWTEXT	TOKTEXT...
"\n"

A blank line indicates a sentence boundary.

Globals & Constants

Variable: @ISA

inherits from Lingua::TT::Persistent and Exporter.

Variable: @EXPORT

No default exports.

Variable: %EXPORT_TAGS

Exported tags:

escape => [qw(escape_rtt unescape_rtt)]

Constructors etc.

new
$ta = $CLASS_OR_OBJECT->new(%opts);

%opts, %$ta:

buf=>$buf,		##-- raw text buffer
lines=>\@lines,	##-- raw tt-lines loaded with Lingua::TT::IO->getLines
off=>$off, len=>$len,	##-- byte offsets and lengths in $buf of lines in \@lines
clear
undef = $ta->clear();

Clears the object.

Methods: I/O: RTT

escape_rtt
$str_escaped = escape_rtt($str);

Escape a raw string $str for inclusion as RTT text.

unescape_rtt
$str_escaped = unescape_rtt($str);

Un-escape an RTT string, returns raw text.

toRttFile
$ta = $ta->toRttFile($filename_or_fh,%opts);

Saves $ta to rtt-file

fromRttFile
$ta = $ta->fromRttFile($filename_or_fh,%opts);

parses @$tta{qw(buf lines off len) from $filename_or_fh

Methods: I/O: TT (+ offsets)

parseOffsetLines
$ta = $ta->parseOffsetLines();

Parses @$ta{qw(off len)} from $ta->{lines}; destructively alters $ta->{lines}.

fromTTFile
$ta = $CLASS_OR_OBJECT->fromTTFile($filename_or_fh,%opts);

parses $ta->{doc} from file

toTTFile
$ta = $ta->toTTFile($filename_or_fh,%opts);

saves $ta to file (with offset+len pairs)

Methods: I/O: text-buffer

loadTextFile
$ta = $ta->loadTextFile($filename_or_fh,%opts);

%opts:

raw => $bool,	##-- set to avoid utf8 flag on buf
saveTextFile
$ta = $ta->saveTextFile($filename_or_fh,%opts);

%opts:

raw => $bool,	##-- set to avoid utf8 flag on buf

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2013-2016 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

Lingua:TT(3pm)|Lingua::TT, perl(1), ...