The Perl and Raku Conference 2025: Greenville, South Carolina - June 27-29 Learn more

NAME

DTA::CAB::Format::TCF - Datum parser|formatter: CLARIN-D TCF (selected features only)

SYNOPSIS

##========================================================================
## PRELIMINARIES
##========================================================================
## Constructors etc.
$fmt = CLASS_OR_OBJ->new(%args);
##========================================================================
## Methods: Input: Generic API
$doc = $fmt->parseDocument();
##========================================================================
## Methods: Output: MIME & HTTP stuff
$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();
##========================================================================
## Methods: Output: output selection
$fmt = $fmt->flush();
##========================================================================
## Methods: Output: Generic API
$fmt = $fmt->putDocument($doc);

DESCRIPTION

Globals

Variable: @ISA

DTA::CAB::Format::TCF inherits from DTA::CAB::Format::XmlCommon.

Constructors etc.

new
$fmt = CLASS_OR_OBJ->new(%args);

object structure: HASH ref

{
##-- new in TCF
tcfbufr => \$buf, ##-- raw TCF buffer, for spliceback mode
textbufr => \$text, ##-- raw text buffer, for spliceback mode
tcflog => $level, ##-- debugging log-level (default: 'off')
spliceback => $bool, ##-- (output) if true (default), splice data back into 'tcfbufr' if available; otherwise create new TCF doc
tcflayers => $tcf_layer_names, ##-- layer names to include, space-separated list; known='tei text tokens sentences postags lemmas orthography'
tcftagset => $tagset, ##-- tagset name for POStags element (default='stts')
logsplice => $level, ##-- log level for spliceback messages (default:'none')
trimtext => $bool, ##-- if true (default), waste tokenizer hints will be trimmed from 'text' layer
##-- input: inherited from XmlCommon
xdoc => $xdoc, ##-- XML::LibXML::Document
xprs => $xprs, ##-- XML::LibXML parser
##-- output: inherited from XmlCommon
level => $level, ##-- output formatting level (OVERRIDE: default=1)
output => [$how,$arg] ##-- either ['fh',$fh], ['file',$filename], or ['str',\$buf]
}

Methods: Input: Generic API

parseDocument
$doc = $fmt->parseDocument();

parse buffered XML::LibXML::Document from $fmt->{xdoc}

Methods: Output: MIME & HTTP stuff

shortName
$short = $fmt->shortName();

returns "official" short name for this format; override returns "tcf".

mimeType
$type = $fmt->mimeType();

override returns text/xml

defaultExtension
$ext = $fmt->defaultExtension();

returns default filename extension for this format; override returns ".tcf.xml".

Methods: Output: output selection

flush
$fmt = $fmt->flush();

flush any buffered output to selected output source

Methods: Output: Generic API

putDocument
$fmt = $fmt->putDocument($doc);

override respects local 'spliceback' and 'tcflayers' flags

EXAMPLE

An example file in the format accepted/generated by this module is:

<?xml version="1.0" encoding="UTF-8"?>
<D-Spin xmlns="http://www.dspin.de/data" version="0.4">
<TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de">
<text>wie oede!</text>
<tokens>
<token ID="w1">wie</token>
<token ID="w2">oede</token>
<token ID="w3">!</token>
</tokens>
<sentences>
<sentence ID="s1" tokenIDs="w1 w2 w3"/>
</sentences>
<lemmas>
<lemma tokenIDs="w1">wie</lemma>
<lemma tokenIDs="w2">öde</lemma>
<lemma tokenIDs="w3">!</lemma>
</lemmas>
<POStags tagset="stts">
<tag tokenIDs="w1">PWAV</tag>
<tag tokenIDs="w2">ADJD</tag>
<tag tokenIDs="w3">$.</tag>
</POStags>
<orthography>
<correction tokenIDs="w2" operation="replace">öde</correction>
</orthography>
</TextCorpus>
</D-Spin>

If the input contains a 'text' layer but no 'tokens' or 'sentences' layers, the 'text' layer will be tokenized using the DTA::CAB::Format::Raw class.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2019 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dta-cab-analyze.perl(1), dta-cab-convert.perl(1), dta-cab-http-server.perl(1), dta-cab-http-client.perl(1), dta-cab-xmlrpc-server.perl(1), dta-cab-xmlrpc-client.perl(1), DTA::CAB::Server(3pm), DTA::CAB::Client(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...