NAME
DTA::CAB::Format::TCF - Datum parser|formatter: CLARIN-D TCF (selected features only)
SYNOPSIS
##========================================================================
## PRELIMINARIES
##========================================================================
## Constructors etc.
$fmt
= CLASS_OR_OBJ->new(
%args
);
##========================================================================
## Methods: Input: Generic API
$doc
=
$fmt
->parseDocument();
##========================================================================
## Methods: Output: MIME & HTTP stuff
$short
=
$fmt
->shortName();
$type
=
$fmt
->mimeType();
$ext
=
$fmt
->defaultExtension();
##========================================================================
## Methods: Output: output selection
$fmt
=
$fmt
->flush();
##========================================================================
## Methods: Output: Generic API
$fmt
=
$fmt
->putDocument(
$doc
);
DESCRIPTION
Globals
- Variable: @ISA
-
DTA::CAB::Format::TCF inherits from DTA::CAB::Format::XmlCommon.
Constructors etc.
- new
-
$fmt
= CLASS_OR_OBJ->new(
%args
);
object structure: HASH ref
{
##-- new in TCF
tcfbufr
=> \
$buf
,
##-- raw TCF buffer, for spliceback mode
textbufr
=> \
$text
,
##-- raw text buffer, for spliceback mode
tcflog
=>
$level
,
##-- debugging log-level (default: 'off')
spliceback
=>
$bool
,
##-- (output) if true (default), splice data back into 'tcfbufr' if available; otherwise create new TCF doc
tcflayers
=>
$tcf_layer_names
,
##-- layer names to include, space-separated list; known='tei text tokens sentences postags lemmas orthography'
tcftagset
=>
$tagset
,
##-- tagset name for POStags element (default='stts')
logsplice
=>
$level
,
##-- log level for spliceback messages (default:'none')
trimtext
=>
$bool
,
##-- if true (default), waste tokenizer hints will be trimmed from 'text' layer
##-- input: inherited from XmlCommon
xdoc
=>
$xdoc
,
##-- XML::LibXML::Document
xprs
=>
$xprs
,
##-- XML::LibXML parser
##-- output: inherited from XmlCommon
level
=>
$level
,
##-- output formatting level (OVERRIDE: default=1)
output
=> [
$how
,
$arg
]
##-- either ['fh',$fh], ['file',$filename], or ['str',\$buf]
}
Methods: Input: Generic API
Methods: Output: MIME & HTTP stuff
- shortName
-
$short
=
$fmt
->shortName();
returns "official" short name for this format; override returns "tcf".
- mimeType
-
$type
=
$fmt
->mimeType();
override returns text/xml
- defaultExtension
-
$ext
=
$fmt
->defaultExtension();
returns default filename extension for this format; override returns ".tcf.xml".
Methods: Output: output selection
Methods: Output: Generic API
- putDocument
-
$fmt
=
$fmt
->putDocument(
$doc
);
override respects local 'spliceback' and 'tcflayers' flags
EXAMPLE
An example file in the format accepted/generated by this module is:
<?xml version=
"1.0"
encoding=
"UTF-8"
?>
<text>wie oede!</text>
<tokens>
<token ID=
"w1"
>wie</token>
<token ID=
"w2"
>oede</token>
<token ID=
"w3"
>!</token>
</tokens>
<sentences>
<sentence ID=
"s1"
tokenIDs=
"w1 w2 w3"
/>
</sentences>
<lemmas>
<lemma tokenIDs=
"w1"
>wie</lemma>
<lemma tokenIDs=
"w2"
>öde</lemma>
<lemma tokenIDs=
"w3"
>!</lemma>
</lemmas>
<POStags tagset=
"stts"
>
<tag tokenIDs=
"w1"
>PWAV</tag>
<tag tokenIDs=
"w2"
>ADJD</tag>
<tag tokenIDs=
"w3"
>$.</tag>
</POStags>
<orthography>
<correction tokenIDs=
"w2"
operation=
"replace"
>öde</correction>
</orthography>
</TextCorpus>
</D-Spin>
If the input contains a 'text' layer but no 'tokens' or 'sentences' layers, the 'text' layer will be tokenized using the DTA::CAB::Format::Raw class.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
dta-cab-analyze.perl(1), dta-cab-convert.perl(1), dta-cab-http-server.perl(1), dta-cab-http-client.perl(1), dta-cab-xmlrpc-server.perl(1), dta-cab-xmlrpc-client.perl(1), DTA::CAB::Server(3pm), DTA::CAB::Client(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...