NAME
DTA::TokWrap::Processor::tcfdecode0 - DTA tokenizer wrappers: TCF[tei,text,tokens,sentences]->TEI,text extraction
SYNOPSIS
use DTA::TokWrap::Processor::tcfdecode0;
$dec = DTA::TokWrap::Processor::tcfdecode0->new(%opts);
$doc_or_undef = $dec->tcfdecode0($doc);
DESCRIPTION
DTA::TokWrap::Processor::tcfdecode0 provides an object-oriented DTA::TokWrap::Processor wrapper for extracting the tei
,text
,tokens
, and sentences
layers from a tokenized TCF ("Text Corpus Format", cf. http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format) document as originally encoded by a DTA::TokWrap::Processor::tcfencode ("tcfencoder") object. The encoded TCF document should have the following layers:
- textSource[@type="application/tei+xml"]
-
Source TEI-XML encoded as an XML text node; should be identical to the source XML {xmlfile} or {xmldata} passed to the tcfencoder. Also accepts type "text/tei+xml".
- text
-
Serialized text encoded as an XML text node; should be identical to the serialized text {txtfile} or {txtdata} passed to the tcfencoder.
- tokens
-
Tokens returned by the tokenizer for the
text
layer. Document order of tokens should correspond exactly to the serial order of the associated text in thetext
layer. - sentences
-
Sentences returned by the tokenizer for the tokens in the
tokens
layer. Document order of sentences must correspond exactly to the serial order of the associated text in thetext
layer.
The following additional layers will be decoded if the decode_tcfa
option is set to a true value:
- lemmas
-
TCF lemmata identified by
tokenIDs
in 1:1 correspondence with the tokens in thetokens
layer. -
TCF part-of-speech tags identified by
tokenIDs
in 1:1 correspondence with the tokens in thetokens
layer. - orthography
-
TCF orthographic normalizations (
replace
operations only) identified bytokenIDs
in 1:1 correspondence with the tokens in thetokens
layer.
Constants
- @ISA
-
DTA::TokWrap::Processor::tcfdecode0 inherits from DTA::TokWrap::Processor.
Constructors etc.
- new
-
$obj = $CLASS_OR_OBJECT->new(%args);
Constructor. Default %args:
decode_tcfx => $bool, ##-- whether to decode $tcfxdata (default=1) decode_tcft => $bool, ##-- whether to decode $tcftdata (default=1) decode_tcfw => $bool, ##-- whether to decode $tcfwdata (default=1) decode_tcfa => $bool, ##-- whether to decode $tcfadata (default=1)
- defaults
-
%defaults = $CLASS->defaults();
Static class-dependent defaults.
Methods
- tcfdecode0
-
$doc_or_undef = $CLASS_OR_OBJECT->tcfdecode0($doc);
Decode0s the {tcfdoc} key of the DTA::TokWrap::Document object to TCF, storing the result in
$doc->{tcfxdata}
,$doc->{tcftdata}
, and$doc->{tcfwdata}
.Relevant %$doc keys:
tcfdoc => $tcfdoc, ##-- (input) TCF input document ## tcfxdata => $tcfxdata, ##-- (output) TEI-XML decode0d from TCF tcftdata => $tcftdata, ##-- (output) text data decode0d from TCF tcfwdata => $tcfwdata, ##-- (output) tokenized data decode0d from TCF, without byte-offsets, with "SID/WID" attributes tcfadata => $tcfadata, ##-- (output) annotation data decode0d from TCF ## tcfdecode0_stamp0 => $f, ##-- (output) timestamp of operation begin tcfdecode0_stamp => $f, ##-- (output) timestamp of operation end tcfxdata_stamp => $f, ##-- (output) timestamp of operation end tcftdata_stamp => $f, ##-- (output) timestamp of operation end tcfwdata_stamp => $f, ##-- (output) timestamp of operation end tcfadata_stamp => $f, ##-- (output) timestamp of operation end
SEE ALSO
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
SEE ALSO
DTA::TokWrap::Intro(3pm), dta-tokwrap.perl(1), ...
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2014-2018 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.