NAME
Lingua::TT::TextAlignment - TT Utils: alignment raw text <-> tokenized text
SYNOPSIS
##========================================================================
## PRELIMINARIES
use Lingua::TT::TextAlignment;
##========================================================================
## Constructors etc.
$ta = CLASS_OR_OBJECT->new(%opts);
undef = $ta->clear();
##========================================================================
## Methods: I/O: RTT ("RAW \t TEXT \t ...", with %%$c= comments)
$str_escaped = escape_rtt($str);
$str_escaped = unescape_rtt($str);
$ta = $ta->toRttFile($filename_or_fh,%opts);
$ta = $ta->fromRttFile($filename_or_fh,%opts);
##========================================================================
## Methods: I/O: TT (+ offsets)
$ta = $ta->parseOffsetLines();
$ta = $CLASS_OR_OBJECT->fromTTFile($filename_or_fh,%opts);
$ta = $ta->toTTFile($filename_or_fh,%opts);
##========================================================================
## Methods: I/O: text-buffer
$ta = $ta->loadTextFile($filename_or_fh,%opts);
$ta = $ta->saveTextFile($filename_or_fh,%opts);
DESCRIPTION
The RTT "raw + text + tags" format is a "vertical" text format for combined storage of explicit token boundaries together with raw original (untokenized) text. It is a line-based formats with lines of the form:
- "%%$RTT:COMPACT=" BOOL
-
RTT processing instruction declaring that this file is (or is not) in "compact" RTT format. BOOL is either 0 (zero) or 1 (one). If unspecified or false, file is assumed to be in "prolix" (non-compact) format.
- "%%" COMMENT
-
Comments begin with "%%" and extend to the end of the line.
- "%%$c=" STRING
-
Indicates a text string STRING in the raw text with no corresponding string in the tt-tokenization; STRING is typically whitespace, and may contain escaped newlines ("\n") or TABs ("\t").
- TEXT "\t" TOK...
-
In "prolix" format, TAB-separated lines indicate aligned raw text and tokenized material. The first field is the raw text the token covers, and subsequent fields are the associated token attributes.
- WHITESPACE? TOKTEXT...
-
In "compact" mode, token lines may be prefixed by optional whitespace, which is assumed to be present only in the raw text representation, and TOKTEXT is assumed to be identical to the raw text covered by the token. Equivalent to the "prolix" lines:
%%$c=WHITESPACE TOKTEXT TOKTEXT...
- WHITESPACE? RAWTEXT " $= " TOKTEXT...
-
"Compact" format for non-identity tokenizations, equivalent to the "prolix" lines:
%%$c=WHITESPACE RAWTEXT TOKTEXT...
- "\n"
-
A blank line indicates a sentence boundary.
Globals & Constants
- Variable: @ISA
-
inherits from Lingua::TT::Persistent and Exporter.
- Variable: @EXPORT
-
No default exports.
- Variable: %EXPORT_TAGS
-
Exported tags:
escape => [qw(escape_rtt unescape_rtt)]
Constructors etc.
- new
-
$ta = $CLASS_OR_OBJECT->new(%opts);
%opts, %$ta:
buf=>$buf, ##-- raw text buffer lines=>\@lines, ##-- raw tt-lines loaded with Lingua::TT::IO->getLines off=>$off, len=>$len, ##-- byte offsets and lengths in $buf of lines in \@lines
- clear
-
undef = $ta->clear();
Clears the object.
Methods: I/O: RTT
- escape_rtt
-
$str_escaped = escape_rtt($str);
Escape a raw string $str for inclusion as RTT text.
- unescape_rtt
-
$str_escaped = unescape_rtt($str);
Un-escape an RTT string, returns raw text.
- toRttFile
-
$ta = $ta->toRttFile($filename_or_fh,%opts);
Saves $ta to rtt-file
- fromRttFile
-
$ta = $ta->fromRttFile($filename_or_fh,%opts);
parses @$tta{qw(buf lines off len) from $filename_or_fh
Methods: I/O: TT (+ offsets)
- parseOffsetLines
-
$ta = $ta->parseOffsetLines();
Parses @$ta{qw(off len)} from $ta->{lines}; destructively alters $ta->{lines}.
- fromTTFile
-
$ta = $CLASS_OR_OBJECT->fromTTFile($filename_or_fh,%opts);
parses $ta->{doc} from file
- toTTFile
-
$ta = $ta->toTTFile($filename_or_fh,%opts);
saves $ta to file (with offset+len pairs)
Methods: I/O: text-buffer
- loadTextFile
-
$ta = $ta->loadTextFile($filename_or_fh,%opts);
%opts:
raw => $bool, ##-- set to avoid utf8 flag on buf
- saveTextFile
-
$ta = $ta->saveTextFile($filename_or_fh,%opts);
%opts:
raw => $bool, ##-- set to avoid utf8 flag on buf
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2013-2016 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.