NAME
DTA::TokWrap::Document - DTA tokenizer wrappers: document wrapper
SYNOPSIS
##========================================================================
## Constructors etc.
$doc
=
$CLASS_OR_OBJECT
->new(
%args
);
%defaults
=
$CLASS
->defaults();
$doc
=
$doc
->init();
$doc
->DESTROY();
##========================================================================
## Methods: Pseudo-I/O
$newdoc
= CLASS_OR_OBJECT->
open
(
$xmlfile
,
%docNewOptions
);
$bool
=
$doc
->
close
();
@notempkeys
=
$doc
->notempkeys();
@tempfiles
=
$doc
->tempfiles();
##========================================================================
## Methods: pseudo-pseudo-make
$bool
=
$doc
->genKey(
$key
);
$keyval_or_undef
=
$doc
->makeKey(
$key
);
##========================================================================
## Methods: Low-Level: generator-subclass wrappers
$doc_or_undef
=
$doc
->mkindex();
$doc_or_undef
=
$doc
->mkbx0();
$doc_or_undef
=
$doc
->mkbx();
$doc_or_undef
=
$doc
->tokenize();
$doc_or_undef
=
$doc
->tok2xml();
$doc_or_undef
=
$doc
->txmlanno();
##========================================================================
## Methods: Member I/O
$bx0doc_or_undef
=
$doc
->loadBx0File();
$cxdata_or_undef
=
$doc
->loadBxFile();
$cxdata_or_undef
=
$doc
->loadCxFile();
\
$tokdata_or_undef
=
$doc
->loadTokFile();
\
$xtokdata_or_undef
=
$doc
->loadXtokFile();
$xtokDoc
=
$doc
->xtokDoc();
\
$xmlbuf_or_undef
=
$doc
->loadXmlData();
\
$txtbuf_or_undef
=
$doc
->loadTxtData();
$file_or_undef
=
$doc
->saveBx0File();
$file_or_undef
=
$doc
->saveBxFile();
$file_or_undef
=
$doc
->saveTxtFile();
$file_or_undef
=
$doc
->saveTokFile();
$file_or_undef
=
$doc
->saveXtokFile();
$file_or_undef
=
$doc
->saveTcfFile();
##========================================================================
## Methods: Profiling
$ntoks_or_undef
=
$doc
->nTokens();
$nxbytes_or_undef
=
$doc
->nXmlBytes();
DESCRIPTION
DTA::TokWrap::Document provides a perl class for representing a single DTA base-format XML file and associated indices. Together with the DTA::TokWrap module, this class comprises the top-level API of the DTA::TokWrap distribution.
Globals
- @ISA
-
DTA::TokWrap::Document inherits from DTA::TokWrap::Base.
- $TOKENIZE_CLASS
-
$TOKENIZE_CLASS
Default tokenizer sub-processor class (default='DTA::TokWrap::Processor::tokenize').
- Variables: ($CX_ID,$CX_XOFF,$CX_XLEN,$CX_TOFF,$CX_TLEN,$CX_TEXT,$CX_ATTRS)
-
Field indices in .cx files generated by the mkindex() method.
Constructors etc.
- new
-
$doc
=
$CLASS_OR_OBJECT
->new(
%args
);
Low-level constructor for document wrapper object. You should probably use either DTA::TokWrap->open() or DTA::TokWrap::Document->open() instead of calling this constructor directly.
%args, %$doc:
##-- Document class
class
=>
$class
,
##-- delegate call to $class->new(%args)
##
##-- Source data
xmlfile
=>
$xmlfile
,
##-- source filename
xmlbase
=>
$xmlbase
,
##-- xml:base for generated files (default=basename($xmlfile))
xmldata
=>
$xmldata
,
##-- source buffer (for addws, tcfencode)
##
##-- pseudo-make options
traceMake
=>
$level
,
##-- log-level for makeKey() trace (e.g. 'debug'; default=undef (none))
traceGen
=>
$level
,
##-- log-level for genKey() trace (e.g. 'trace'; default=undef (none))
traceProc
=>
$level
,
##-- log-level for document-called processor calls (default=none)
traceLoad
=>
$level
,
##-- log-level for load* trace (default=none)
traceSave
=>
$level
,
##-- log-level for save* trace (default=none)
genDummy
=>
$bool
,
##-- if true, generator will not actually run (a la `make -n`)
##
##-- generator data (optional)
tw
=>
$tw
,
##-- a DTA::TokWrap object storing individual generators
traceOpen
=>
$leve
,
##-- log-lvel for open() trace (e.g. 'info'; default=undef (none))
traceClose
=>
$level
,
##-- log-level for close() trace (e.g. 'trace'; default=undef (none))
##
##-- generated data (common)
outdir
=>
$outdir
,
##-- output directory for generated data (default=.)
tmpdir
=>
$tmpdir
,
##-- temporary directory for generated data (default=$ENV{DTATW_TMP}||$outdir)
keeptmp
=>
$bool
,
##-- if true, temporary document-local files will be kept on $doc->close()
notmpre
=>
$regex
,
##-- non-temporary filename regex
notmpkeys
=>
$keys
,
##-- non-temporary keys, space-separated list
outbase
=>
$filebase
,
##-- output basename (default=`basename $xmlbase .xml`)
format
=>
$level
,
##-- default formatting level for XML output
##
##-- mkindex data (see DTA::TokWrap::Processor::mkindex)
cxfile
=>
$cxfile
,
##-- character index file (default="$tmpdir/$outbase.cx")
cxdata
=>
$cxdata
,
##-- character index data (see loadCxFile() method)
sxfile
=>
$sxfile
,
##-- structure index file (default="$tmpdir/$outbase.sx")
txfile
=>
$txfile
,
##-- raw text index file (default="$tmpdir/$outbase.tx")
##
##-- mkbx0 data (see DTA::TokWrap::Processor::mkbx0)
bx0doc
=>
$bx0doc
,
##-- pre-serialized block-index XML::LibXML::Document
bx0file
=>
$bx0file
,
##-- pre-serialized block-index XML file (default="$outbase.bx0"; optional)
##
##-- mkbx data (see DTA::TokWrap::Processor::mkbx)
bxdata
=> \
@bxdata
,
##-- block-list, see DTA::TokWrap::mkbx::mkbx() for details
bxfile
=>
$bxfile
,
##-- serialized block-index CSV file (default="$tmpdir/$outbase.bx"; optional)
txtfile
=>
$txtfile
,
##-- serialized & hinted text file (default="$tmpdir/$outbase.txt"; optional)
txtdata
=>
$txtdata
,
##-- serialized & hinted text file (used by tcfencode, must be loaded explicitly with loadTxtData())
##
##-- tokenize data (see DTA::TokWrap::Processor::tokenize, DTA::TokWrap::Processor::tokenize::dummy)
tokdata0
=>
$tokdata0
,
##-- tokenizer output data (slurped string)
tokfile0
=>
$tokfile0
,
##-- tokenizer output file (default="$tmpdir/$outbase.t0"; optional)
##
##-- post-tokenize data (see DTA::TokWrap::Processor::tokenize1)
tokdata1
=>
$tokdata1
,
##-- post-tokenizer output data (slurped string)
tokfile1
=>
$tokfile1
,
##-- post-tokenizer output file (default="$tmpdir/$outbase.t1"; optional)
##
##-- tokenizer xml data (see DTA::TokWrap::Processor::tok2xml)
xtokdata
=>
$xtokdata
,
##-- XML-ified tokenizer output data
xtokfile
=>
$xtokfile
,
##-- XML-ified tokenizer output file (default="$outdir/$outbase.t.xml")
xtokdoc
=>
$xtokdoc
,
##-- XML::LibXML::Document for $xtokdata (parsed from string)
##
##-- tokenizer xml annotations (see DTA::TokWrap::Processor::txmlanno)
axtokdata
=>
$axtokdata
,
##-- optional external XML annotation data (for splicing into $xtokdata)
axtokfile
=>
$axtokfile
,
##-- optional external XML annotation file (for splicing into $xtokfile; default="$outdir/$outbase.ta.xml")
xtokfile0
=>
$xtokfile0
,
##-- XML-ified tokenizer output file (default=none or "$outdir/$outbase.t0.xml" if {keeptmp} is true)
##
##-- ws-splice (see DTA::TokWrap::Processor::addws)
#cwsdata => $cwsdata, ##-- ws-spliced output data (xmlfile with <s> and <w> elements)
cwsfile
=>
$cwsfile
,
##-- ws-spliced output file (default="$outdir/$outbase.cws.xml")
##
##-- property-splice (see DTA::TokWrap::Processor::idsplice)
## cwstbasebufr => \$bdata, ##-- base data-ref for idsplice (xml with //*/@id) [default=\$cwsdata if defined]
## cwstbasefile => $bfile, ##-- source file for $bdata [default=$cwsfile]
## cwstsobufr => \$sodata, ##-- standoff data-ref for idsplice (xml with //*/@id, additional attributes and content) [default=\$xtokdata]
## cwstsofile => $sofile, ##-- source file for $sodata [default=$xtokfile]
## cwstbufr => $wstbufr, ##-- idsplice output buffer (base + id-spliced attributes, content) -- available for override, not used by default
## cwstfile => $wstfile, ##-- idsplice output file [default="$outdir/$outbase.cwst.xml"]
##
##-- tcfencode data (see DTA::TokWrap::Processor::tcfencode)
tcfdoc
=>
$tcfdoc
,
##-- XML::LibXML::Document representing TCF-encoded data
tcffile
=>
$tcffile
,
##-- TCF file
tcflang
=>
$lang
,
##-- TCF language attribute (default: 'de')
##
##-- tcftokenize data (see DTA::TokWrap::Processor::tcftokenize)
tcftokdoc
=>
$tcftokdoc
,
##-- XML::LibXML::Document representing tokenized TCF data (== $tcfdoc)
tcftokfile
=>
$tcftokfile
,
##-- tcf-tokenized file
##
##-- tcfdecode0 data (see DTA::TokWrap::Processor::tcfdecode0)
tcfxfile
=>
$tcfxfile
,
##-- tcf-decoded base xml file [default="$tmpdir/$outbase.tcfx"]
tcfxdata
=>
$tcfxdata
,
##-- tcf-decoded base xml data
tcftfile
=>
$tcftfile
,
##-- tcf-decoded serial text file [default="$tmpdir/$outbase.tcft"]
tcftdata
=>
$tcftdata
,
##-- tcf-decoded serial txt data
tcfwdata
=>
$tcfwdata
,
##-- tcf-decoded token data, tt-format: "TEXT\tSID/WID\n"
tcfwfile
=>
$tcfwfile
,
##-- tcf-decoded token file, tt-format [default="$tmpdir/$outbase.tcfw"]
tcfadata
=>
$tcfadata
,
##-- tcf-decoded token attributes for idsplice, data
tcfafile
=>
$tcfafile
,
##-- tcf-decoded token attributes for idsplice, file [default="$tmpdir/$outbase.tcfa"]
##
##-- tcfalign data (PROXIED, see DTA::TokWrap::Processor::tcfalign : uses tokdata1,tokfile1)
##-- tcf2txml data (PROXIED, see DTA::TokWrap::Processor::tok2xml : uses tokfile1,cxfile,bxfile,xtokdata)
##-- tcfdecode data
tcfcwsfile
=>
$tcfcwsfile
,
##-- tcf-decoded+aligned+ws-spliced output file (default="$outdir/$outbase.tcfws.xml")
- defaults
-
%defaults
= CLASS->defaults();
Static object defaults.
- init
-
$doc
=
$doc
->init();
Set computed object defaults.
- DESTROY
-
$doc
->DESTROY();
Destructor. Implicitly calls close().
Methods: Pseudo-I/O
- open
-
$newdoc
=
$CLASS_OR_OBJECT
->
open
(
$xmlfile
,
%docNewOptions
);
Wrapper for $CLASS_OR_OBJECT->new(), with some additional sanity checks.
- close
-
$bool
=
$doc
->
close
();
$bool
=
$doc
->
close
(
$is_destructor
);
"Closes" document $doc, adding profiling information to $doc->{tw} if present.
Unlinks any temporary files in $doc unless $doc->{keeptmp} is true. All %$doc keys ending in 'file' are considered 'temporary' files, except: xmlfile, xtokfile, sosfile, sowfile, soafile
If $is_destructor is false (default), resets all keys in %$doc to default values (thus making $doc essentially unuseable).
- notempkeys
-
@notempkeys
=
$doc
->notempkeys();
Returns list of document keys ending 'file' which are not considered "temporary" Used by $doc->tempfiles().
- tempfiles
-
@tempfiles
=
$doc
->tempfiles();
Returns list of temporary filenames which have been generated by $doc, or an empty list if $doc->{keeptmp} is true. Used by $doc->close().
Checks $doc->{"${filekey}_stamp"} to determine whether this document generated the file named by $doc->{"$filekey"}.
Implementation: returns values of all %$doc keys ending with 'file' except for those returned by $doc->notempkeys()
Methods: pseudo-pseudo-make
- %KEYGEN
-
%KEYGEN
= (
$dataKey
=>
$generatorSpec
, ...)
Low-level hash mapping data keys to the generating processes (subroutines, classes, ...).
$generatorSpec is one of:
$key
: calls
$doc
->can(
$key
)->(
$doc
)
\
&coderef
: calls
&coderef
(
$doc
)
\
@array
: array of atomic
$generatorSpecs
(
keys
or CODE-refs)
- genKey
-
$bool
=
$doc
->genKey(
$key
);
$bool
=
$doc
->genKey(
$key
,\
%KEYGEN
)
(Re-)generate a data key (single step only, ignoring dependencies). An argument $key without a value $KEYGEN{$key} triggers an error.
- makeKey
-
$keyval_or_undef
=
$doc
->makeKey(
$key
);
Just an alias for $doc->genKey($key) here, but see DTA::TokWrap::Document::Maker for a more sophisticated implementation
Methods: Low-Level: generator-subclass wrappers
- mkindex
-
$doc_or_undef
=
$doc
->mkindex(
$mkindex
);
$doc_or_undef
=
$doc
->mkindex();
- mkbx0
-
$doc_or_undef
=
$doc
->mkbx0(
$mkbx0
);
$doc_or_undef
=
$doc
->mkbx0();
- mkbx
-
$doc_or_undef
=
$doc
->mkbx(
$mkbx
);
$doc_or_undef
=
$doc
->mkbx();
- tokenize
-
$doc_or_undef
=
$doc
->tokenize(
$tokenize
);
$doc_or_undef
=
$doc
->tokenize();
see DTA::TokWrap::Processor::tokenize::tokenize(), DTA::TokWrap::Processor::tokenize::http::tokenize(), DTA::TokWrap::Processor::tokenize::tomasotath::tokenize(), DTA::TokWrap::Processor::tokenize::dummy::tokenize().
Default tokenizer subclass is given by package-global $TOKENIZE_CLASS.
- tokenize1
-
$doc_or_undef
=
$doc
->tokenize1(
$tokenize1
);
$doc_or_undef
=
$doc
->tokenize1();
- tok2xml
-
$doc_or_undef
=
$doc
->tok2xml(
$tok2xml
);
$doc_or_undef
=
$doc
->tok2xml();
- txmlanno
-
$doc_or_undef
=
$doc
->txmlanno(
$txmlanno
);
$doc_or_undef
=
$doc
->txmlanno();
- addws
-
$doc_or_undef
=
$doc
->addws(
$addws
);
$doc_or_undef
=
$doc
->addws();
- idsplice
-
$doc_or_undef
=
$doc
->idsplice(
$addws
);
$doc_or_undef
=
$doc
->idsplice();
- tcfencode
-
$doc_or_undef
=
$doc
->tcfencode(
$tcfencode
)
$doc_or_undef
=
$doc
->tcfencode()
Methods: Member I/O
- loadBx0File
-
$bx0doc_or_undef
=
$doc
->loadBx0File(
$filename_or_fh
);
$bx0doc_or_undef
=
$doc
->loadBx0File();
loads $doc->{bx0doc} from $filename_or_fh (default=$doc->{bx0file})
- loadBxFile
-
$cxdata_or_undef
=
$doc
->loadBxFile(
$bxfile_or_fh
,
$txtfile_or_fh
);
$cxdata_or_undef
=
$doc
->loadBxFile();
loads $doc->{bxdata} from @$doc{qw(bxfile txtfile)}
requires $doc->{txfile}
- loadCxFile
-
$cxdata_or_undef
=
$doc
->loadCxFile(
$filename_or_fh
);
$cxdata_or_undef
=
$doc
->loadCxFile();
loads $doc->{cxdata} from $filename_or_fh (default=$doc->{cxfile}).
$doc->{cxdata} = [ $cx0, ... ], where:
each $cx = [ $id, $xoff,$xlen, $toff,$tlen, $text, @attrs ]
package globals $CX_ID, $CX_XOFF, etc. are indices for $cx arrays
- loadTokFileN
-
\
$tokdata_or_undef
=
$doc
->loadTokFileN(
$n
,
$filename_or_fh
);
\
$tokdata_or_undef
=
$doc
->loadTokFileN(
$n
);
loads $doc->{"tokdata${n}"} from $filename_or_fh (default=$doc->{"tokfile${n}"})
- loadTokFile0
-
\
$tokdata0_or_undef
=
$doc
->loadTokFile0(
@args
)
Wrapper for $doc->loadTokFileN(0,@args)
- loadTokFile1
-
\
$tokdata1_or_undef
=
$doc
->loadTokFile1(
@args
)
Wrapper for $doc->loadTokFileN(1,@args)
- loadXtokFile
-
\
$xtokdata_or_undef
=
$doc
->loadXtokFile(
$filename_or_fh
);
\
$xtokdata_or_undef
=
$doc
->loadXtokFile();
loads $doc->{xtokdata} from $filename_or_fh (default=$doc->{xtokfile})
see also $doc->xtokDoc().
- xtokDoc
-
$xtokDoc
=
$doc
->xtokDoc(\
$xtokdata
);
$xtokDoc
=
$doc
->xtokDoc();
parse \$xtokdata (default: \$doc->{xtokdata}) string into $doc->{xtokdoc}
warning: may call $doc->tok2xml()
- loadXmlData
-
$xmlbuf_or_undef
=
$doc
-E<gt>loadXmlData(
$filename_or_fh
)
$xmlbuf_or_undef
=
$doc
-E<gt>loadXmlData()
loads $doc->{xmldata} from $filename_or_fh (default=$doc->{xmlfile}).
- loadCwsData
-
\
$xmlbuf_or_undef
=
$doc
->loadCwsData(
$filename_or_fh
)
\
$xmlbuf_or_undef
=
$doc
->LoadCwsData()
DEPRECATED
loads $doc->{cwsdata} from $filename_or_fh (default=$doc->{cwsfile}).
- loadTxtData
-
\
$txtbuf_or_undef
=
$doc
->loadTxtData(
$filename_or_fh
)
\
$txtbuf_or_undef
=
$doc
->loadTxtData()
loads $doc->{txtdata} from $filename_or_fh (default=$doc->{txtfile})
- saveBx0File
-
$file_or_undef
=
$doc
->saveBx0File(
$filename_or_fh
,
$bx0doc
,
%opts
);
$file_or_undef
=
$doc
->saveBx0File(
$filename_or_fh
);
$file_or_undef
=
$doc
->saveBx0File();
Saves $bx0doc (default=$doc->{bx0doc}) to $filename_or_fh (default=$doc>{bx0file}="$doc->{outdir}/$doc->{outbase}.bx0"), and sets both $doc>{bx0file} and $doc->{bx0file_stamp}.
%opts:
format
=>
$level
,
##-- output format (default=$doc-E<gt>{format})
- saveBxFile
-
$file_or_undef
=
$doc
->saveBxFile(
$filename_or_fh
,\
@blocks
);
$file_or_undef
=
$doc
->saveBxFile(
$filename_or_fh
);
$file_or_undef
=
$doc
->saveBxFile();
Saves text-block data \@blocks (default=$doc->{bxdata}) to $filename_of_fh (default=$doc->{bxfile}), and sets both $doc->{bxfile} and $doc->{bxfile_stamp}.
- saveTxtFile
-
$file_or_undef
=
$doc
->saveTxtFile(
$filename_or_fh
,\
@blocks
,
%opts
);
$file_or_undef
=
$doc
->saveTxtFile(
$filename_or_fh
);
$file_or_undef
=
$doc
->saveTxtFile();
Saves serialized text extracted from \@blocks (default=$doc->{bxdata}) to $filename_or_fh (default=$doc->{txtfile}="$doc->{outdir}/$doc->{outbase}.txt"), and sets both $doc->{txtfile} and $doc->{txtfile_stamp}.
%opts:
debug
=>
$bool
,
##-- if true, debugging text will be printed (and saveBxFile() offsets will be wrong)
- saveTokFileN
-
$file_or_undef
=
$doc
->saveTokFileN(
$n
,
$filename_or_fh
,\
$tokdata
);
$file_or_undef
=
$doc
->saveTokFileN(
$n
,
$filename_or_fh
);
$file_or_undef
=
$doc
->saveTokFileN(
$n
);
Saves tokenizer output data string $tokdata (default=$doc->{"tokdata${n}"}) to $filename_or_fh (default=$doc->{"tokfile${n}"}="$doc->{outdir}/$doc->{outbase}.t${n}"), and sets both $doc->{"tokfile${n}"} and $doc->{"tokfile_stamp${n}"}.
- saveTokFile0
-
$file_or_undef
=
$doc
->saveTokFile0(
@args
)
Wrapper for $doc->saveTokFileN(0,@args)
- saveTokFile1
-
$file_or_undef
=
$doc
->saveTokFile1(
@args
)
Wrapper for $doc->saveTokFileN(1,@args)
- saveXtokFile
-
$file_or_undef
=
$doc
->saveXtokFile(
$filename_or_fh
,\
$xtokdata
,
%opts
);
$file_or_undef
=
$doc
->saveXtokFile(
$filename_or_fh
);
$file_or_undef
=
$doc
->saveXtokFile();
Saves XML-ified master tokenizer data string $xtokdata (default=$doc->{xtokdata}) to $filename_or_fh (default=$doc->{xtokfile}="$doc->{outdir}/$doc->{outbase}.t.xml"), and sets both $doc->{xtokfile} and $doc->{xtokfile_stamp}.
- saveTcfFile
-
$file_or_undef
=
$doc
->saveTcfFile(
$filename_or_fh
,
$tcfdoc
,
%opts
)
$file_or_undef
=
$doc
->saveTcfFile(
$filename_or_fh
)
$file_or_undef
=
$doc
->saveTcfFile()
known %opts:
format
=>
$level
,
##-- formatting level (default=1)
Saves TCF-encoded document $tcfdoc (default=$doc->{tcfdoc}) to $filename_or_fh (default=$doc->{tcffile}="$doc->{outdir}/$doc->{outbase}.t.xml"), and sets $doc->{tcffile_stamp}.
Methods: Profiling
- nTokens
-
$ntoks_or_undef
=
$doc
->nTokens();
Returns number of tokens in the currently opened document, if known.
- nXmlBytes
-
$nxbytes_or_undef
=
$doc
->nXmlBytes();
Returns the number of bytes in the base-format XML file, if known (and it should always be known!).
SEE ALSO
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
SEE ALSO
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.