NAME
DTA::TokWrap::Processor::mkbx - DTA tokenizer wrappers: (bx0doc,tx) -> bxdata
SYNOPSIS
use DTA::TokWrap::Processor::mkbx;
$mbx = DTA::TokWrap::Processor::mkbx->new(%opts);
$doc_or_undef = $mbx->mkbx($doc);
DESCRIPTION
DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.
Most users should use the high-level DTA::TokWrap wrapper class instead of using this module directly.
Constants
- @ISA
-
DTA::TokWrap::Processor::mkbx inherits from DTA::TokWrap::Processor.
Constructors etc.
- new
-
$obj = $CLASS_OR_OBJECT->new(%args);
Constructor.
%args, %$obj:
##-- Block-sorting: hints wbStr => $wbStr, ##-- word-break hint text sbStr => $sbStr, ##-- sentence-break hint text sortkey_attr => $attr, ##-- sort-key attribute (default='dta.tw.key'; should jive with mkbx0) ##-- Block-sorting: low-level data xp => $xml_parser, ##-- XML::Parser object for parsing $doc->{bx0doc}
- defaults
-
%defaults = CLASS->defaults();
Static class-dependent defaults.
- init
-
$mbx = $mbx->init();
Dynamic object-dependent defaults.
- initXmlParser
-
$xp = $mbx->initXmlParser();
Create & initialize $mbx->{xp}, an XML::Parser object used to parse $doc->{bx0data}.
Methods: mkbx (bx0doc, txfile) => bxdata
- mkbx
-
$doc_or_undef = $CLASS_OR_OBJECT->mkbx($doc);
Creates the serialized text-block-index $doc->{bxdata} for the DTA::TokWrap::Document object $doc.
Relevant %$doc keys:
bx0doc => $bx0doc, ##-- (input) preliminary block-index data (XML::LibXML::Document) txfile => $txfile, ##-- (input) raw text index filename bxdata => \@blocks, ##-- (output) serialized block index ## mkbx_stamp0 => $f, ##-- (output) timestamp of operation begin mkbx_stamp => $f, ##-- (output) timestamp of operation end bxdata_stamp => $f, ##-- (output) timestamp of operation end
Block data: @{$doc->{bxdata}} = @blocks = ($blk0, ..., $blkN); %$blk =
key => $sortkey, ##-- (inherited) sort key elt => $eltname, ##-- element name which created this block xoff => $xoff, ##-- XML byte offset where this block run begins xlen => $xlen, ##-- XML byte length of this block (0 for hints) toff => $toff, ##-- raw-text (.tx) byte offset where this block run begins tlen => $tlen, ##-- raw-text (.tx) byte length of this block (0 for hints) otext => $otext, ##-- output text (.txt) for this block otoff => $otoff, ##-- output text (.txt) byte offset where this block run begins otlen => $otlen, ##-- output text (.txt) length (bytes)
- prune_empty_blocks
-
\@blocks = $mbx->prune_empty_blocks(\@blocks); \@blocks = $mbx->prune_empty_blocks();
Low-level utility.
Removes empty 'c'-type blocks from @blocks (default=$mbx->{blocks}).
- sort_blocks
-
\@blocks = $mbx->sort_blocks(\@blocks);
Low-level utility.
Sorts \@blocks (default=$mbx->{blocks}) using $mb->{key2i}.
- compute_block_text
-
\@blocks = $mbx->compute_block_text(\@blocks, \$txbuf); \@blocks = $mbx->compute_block_text(\@blocks); \@blocks = $mbx->compute_block_text();
Low-level utility.
Sets $blk->{otoff}, $blk->{otlen}, $blk->{otext} for each block $blk in @blocks (default=$mbx->{blocks}) by extracting raw-text (.tx) substrings from \$txbuf (default=$mbx->{txbufr}).
\@blocks should already have been sorted before this method is called.
SEE ALSO
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
SEE ALSO
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.