NAME

DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata

SYNOPSIS

use DTA::TokWrap::Processor::mkbx;

$mbx = DTA::TokWrap::Processor::mkbx->new(%opts);
$doc_or_undef = $mbx->mkbx($doc);

DESCRIPTION

DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.

Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.

Constants

@ISA

DTA::TokWrap::Processor::mkbx inherits from DTA::TokWrap::Processor.

Constructors etc.

new
$obj = $CLASS_OR_OBJECT->new(%args);

Constructor.

%args, %$obj:

##-- Block-sorting: hints
wbStr => $wbStr,                   ##-- word-break hint text
sbStr => $sbStr,                   ##-- sentence-break hint text
sortkey_attr => $attr,             ##-- sort-key attribute (default='dta.tw.key'; should jive with mkbx0)

##-- Block-sorting: low-level data
xp    => $xml_parser,              ##-- XML::Parser object for parsing $doc->{bx0doc}
defaults
%defaults = CLASS->defaults();

Static class-dependent defaults.

init
$mbx = $mbx->init();

Dynamic object-dependent defaults.

initXmlParser
$xp = $mbx->initXmlParser();

Create & initialize $mbx->{xp}, an XML::Parser object used to parse $doc->{bx0data}.

Methods: mkbx (bx0doc, txfile) => bxdata

mkbx
$doc_or_undef = $CLASS_OR_OBJECT->mkbx($doc);

Creates the serialized text-block-index $doc->{bxdata} for the DTA::TokWrap::Document object $doc.

Relevant %$doc keys:

bx0doc  => $bx0doc,  ##-- (input) preliminary block-index data (XML::LibXML::Document)
txfile  => $txfile,  ##-- (input) raw text index filename
bxdata  => \@blocks, ##-- (output) serialized block index
##
mkbx_stamp0 => $f,   ##-- (output) timestamp of operation begin
mkbx_stamp  => $f,   ##-- (output) timestamp of operation end
bxdata_stamp => $f,  ##-- (output) timestamp of operation end

Block data: @{$doc->{bxdata}} = @blocks = ($blk0, ..., $blkN); %$blk =

key    => $sortkey, ##-- (inherited) sort key
elt    => $eltname, ##-- element name which created this block
xoff   => $xoff,    ##-- XML byte offset where this block run begins
xlen   => $xlen,    ##-- XML byte length of this block (0 for hints)
toff   => $toff,    ##-- raw-text (.tx) byte offset where this block run begins
tlen   => $tlen,    ##-- raw-text (.tx) byte length of this block (0 for hints)
otext  => $otext,   ##-- output text (.txt) for this block
otoff  => $otoff,   ##-- output text (.txt) byte offset where this block run begins
otlen  => $otlen,   ##-- output text (.txt) length (bytes)
prune_empty_blocks
\@blocks = $mbx->prune_empty_blocks(\@blocks);
\@blocks = $mbx->prune_empty_blocks();

Low-level utility.

Removes empty 'c'-type blocks from @blocks (default=$mbx->{blocks}).

sort_blocks
\@blocks = $mbx->sort_blocks(\@blocks);

Low-level utility.

Sorts \@blocks (default=$mbx->{blocks}) using $mb->{key2i}.

compute_block_text
\@blocks = $mbx->compute_block_text(\@blocks, \$txbuf);
\@blocks = $mbx->compute_block_text(\@blocks);
\@blocks = $mbx->compute_block_text();

Low-level utility.

Sets $blk->{otoff}, $blk->{otlen}, $blk->{otext} for each block $blk in @blocks (default=$mbx->{blocks}) by extracting raw-text (.tx) substrings from \$txbuf (default=$mbx->{txbufr}).

\@blocks should already have been sorted before this method is called.

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

SEE ALSO

DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2018 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.