NAME

DTA::TokWrap::Processor::tok2xml::perl - DTA tokenizer wrappers: t -> t.xml, pure-perl (slow, obsolete)

SYNOPSIS

use DTA::TokWrap::Processor::tok2xml::perl;

$t2x = DTA::TokWrap::Processor::tok2xml::perl->new(%opts);
$doc_or_undef = $t2x->tok2xml($doc);

DESCRIPTION

This module is deprecated; prefer DTA::TokWrap::Processor::tok2xml.

DTA::TokWrap::Processor::tok2xml::perl provides a pure-perl object-oriented DTA::TokWrap::Processor wrapper for converting "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format, for use with DTA::TokWrap::Document objects.

Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.

Constants

@ISA

DTA::TokWrap::Processor::tok2xml::perl inherits from DTA::TokWrap::Processor, and supports basically the same API as DTA::TokWrap::Processor::tok2xml.

$NOC

Integer indicating a missing or implicit 'c' record; should be equivalent in value to the C code:

unsigned int NOC = ((unsigned int)-1)

for 32-bit "unsigned int"s.

Constructors etc.

new

$t2x = $CLASS_OR_OBJECT->new(%args);

Constructor.

%args, %$t2x:

##-- output document structure
docElt   => $elt,  ##-- output document element
sElt     => $elt,  ##-- output sentence element
wElt     => $elt,  ##-- output token element
aElt     => $elt,  ##-- output token-analysis element
posAttr  => $attr, ##-- output byte-position attribute
textAttr => $attr, ##-- output token-text attribute

You probably should NOT change any of the default output document structure options (unless this is the final module in your processing pipeline), since their values have ramifications beyond this module.

defaults

%defaults = CLASS->defaults();

Static class-dependent defaults.

Methods: Document Processing

tok2xml

$doc_or_undef = $CLASS_OR_OBJECT->tok2xml($doc);

Converts "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format in the DTA::TokWrap::Document object $doc.

Relevant %$doc keys:

bxdata   => \@bxdata,  ##-- (input) block index data
tokdata  => $tokdata,  ##-- (input) tokenizer output data (string)
cxdata   => \@cxchrs,  ##-- (input) character index data (array of arrays)
cxfile   => $cxfile,   ##-- (input) character index file
xtokdata => $xtokdata, ##-- (output) tokenizer output as XML
nchrs    => $nchrs,    ##-- (output) number of character index records
ntoks    => $ntoks,    ##-- (output) number of tokens parsed
##
tok2xml_stamp0 => $f,  ##-- (output) timestamp of operation begin
tok2xml_stamp  => $f,  ##-- (output) timestamp of operation end
xtokdata_stamp => $f,  ##-- (output) timestamp of operation end

$%t2x keys (temporary, for debugging):

tb2ci   => $tb2ci,     ##-- (temp) s.t. vec($tb2ci, $txbyte, 32) = $char_index_of_txbyte
ntb     => $ntb,       ##-- (temp) number of text bytes

may implicitly call $doc->mkbx(), $doc->loadCxFile(), $doc->tokenize() (but shouldn't!)

txbyte_to_ci

\$tb2ci = $t2x->txbyte_to_ci(\@cxdata);

Low-level utility method.

Sets %$t2x keys: tb2ci, ntb, nchr

txtbyte_to_ci

\$ob2ci = $t2x->txtbyte_to_ci(\@cxdata,\@bxdata,\$tb2ci);

Low-level utility method

Sets %$t2x keys: ob2ci

process_tt_data

\$tokxmlr = $t2x->process_tt_data($doc);

Low-level utility method.

Actually populates $doc->{xtokdata} by parsing $doc->{tokdata}, referring to $t2x->{ob2ci} for character-index lookup.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

To install DTA::TokWrap, copy and paste the appropriate command in to your terminal.

cpanm

cpanm DTA::TokWrap

CPAN shell

perl -MCPAN -e shell
install DTA::TokWrap

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

Constants

Constructors etc.

Methods: Document Processing

SEE ALSO

SEE ALSO

AUTHOR

COPYRIGHT AND LICENSE

NAME

SYNOPSIS

DESCRIPTION

Constants

Constructors etc.

Methods: Document Processing

SEE ALSO

SEE ALSO

AUTHOR

COPYRIGHT AND LICENSE

Module Install Instructions