NAME
DTA::CAB::Format::TEI - Datum parser|formatter: TEI-XML using DTA::TokWrap
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DTA::CAB::Format::TEI;
##========================================================================
## Constructors etc.
$fmt = CLASS_OR_OBJ->new(%args);
$fmt->DESTROY();
##========================================================================
## Methods: Generic
$dir = $fmt->tmpdir();
$tmpdir = $fmt->mktmpdir();
$fmt = $fmt->rmtmpdir();
$txmlfmt = $fmt->txmlfmt();
$class = $fmt->txmlclass();
$tw = $fmt->tw();
##========================================================================
## Methods: Input: Generic API
$fmt = $fmt->close();
$fmt = $fmt->fromString(\$string);
$fmt = $fmt->fromFile($filename_or_handle);
$fmt = $fmt->fromFh($handle);
$doc = $fmt->parseDocument();
##========================================================================
## Methods: Output: MIME & HTTP stuff
$short = $fmt->shortName();
$ext = $fmt->defaultExtension();
##========================================================================
## Methods: Output: output selection
$fmt = $fmt->flush();
$fmt = $fmt->toString(\$str);
$fmt_or_undef = $fmt->toFile($filename, $formatLevel);
$fmt_or_undef = $fmt->toFh($fh,$formatLevel);
##========================================================================
## Methods: Output: Generic API
$fmt = $fmt->putDocument($doc);
DESCRIPTION
Globals
- Variable: @ISA
-
DTA::CAB::Format::TEI inherits from DTA::CAB::Format::XmlTokWrap.
- Variable: $TXML_CLASS_DEFAULT
-
Default parser/formatter class for *.t.xml files; by default DTA::CAB::Format::XmlTokWrap. The alternative DTA::CAB::Format::XmlTokWrapFast is ca. 2x faster, but doesn't support all token attributes.
Constructors etc.
- new
-
$fmt = CLASS_OR_OBJ->new(%args);
object structure: HASH ref
{ ##-- new in TEI tmpdir => $dir, ##-- temporary directory for this object (default: new) keeptmp => $bool, ##-- keep temporary directory open teilog => 'off', ##-- tei format debug log level twlog => 'off', ##-- DTA::TokWrap debug log level (also consider specifying e.g. -lo=twLevel=TRACE on the command-line) addc => $bool_or_guess, ##-- (input) whether to add //c elements (slow no-op if already present; default=0) spliceback => $bool, ##-- (output) if true (default), return .cws.cab.xml ; otherwise just .cab.t.xml [requires doc 'teibufr' attribute] keeptext => $bool, ##-- (input) if true (default), include 'textbufr' element for extract TEI text keepc => $bool, ##-- (output) whether to include //c elements in spliceback-mode output (default=0) tw => $tw, ##-- underlying DTA::TokWrap object twopen => \%opts, ##-- options for $tw->open() teibufr => \$buf, ##-- raw tei+c buffer, for spliceback mode textbufr => \$buf, ##-- raw text buffer, for keeptext mode txmlfmt => $fmt, ##-- classname or object for parsing tokwrap *.t.xml files (default: DTA::CAB::Format::TokWrap) txmlopts => \%opts, ##-- options for *.t.xml sub-formatter (clobbers %$fmt options) 'att.linguistic' => $bool, ##-- use TEI att.linguistic features? (forces txmlfmt, txmlopts, twopts) ## ##-- input: inherited from XmlNative xdoc => $xdoc, ##-- XML::LibXML::Document xprs => $xprs, ##-- XML::LibXML parser ## ##-- output: new #outfile => $filename, ##-- final output file (flushed with File::Copy::copy) ## ##-- output: inherited from XmlTokWrap arrayEltKeys => \%akey2ekey, ##-- maps array keys to element keys for output arrayImplicitKeys => \%akey2undef, ##-- pseudo-hash of array keys NOT mapped to explicit elements key2xml => \%key2xml, ##-- maps keys to XML-safe names xml2key => \%xml2key, ##-- maps xml keys to internal keys ## ##-- output: inherited from XmlNative #encoding => $inputEncoding, ##-- default: UTF-8; applies to output only! level => $level, ##-- output formatting level (default=0) ## ##-- common: safety safe => $bool, ##-- if true (default), no "unsafe" token data will be generated (_xmlnod,etc.) }
- DESTROY
-
$fmt->DESTROY();
destructor implicitly calls $fmt->rmtmpdir()
Methods: Generic
- tmpdir
-
$dir = $fmt->tmpdir();
get/generate name of temporary directory, ensures $fmt->{tmpdir} is set
- mktmpdir
-
$tmpdir = $fmt->mktmpdir();
ensures $fmt->tmpdir() exists
- rmtmpdir
-
$fmt = $fmt->rmtmpdir();
removes $fmt->{tmpdir} unless $fmt->{keeptmp} is true
- txmlfmt
-
$txmlfmt = $fmt->txmlfmt();
gets cached $fmt->{txmlfmt} or creates it
- txmlclass
-
$class = $fmt->txmlclass();
(undocumented)
- tw
-
$tw = $fmt->tw();
returns DTA::TokWrap object for $fmt; calls $fmt->tmpdir()
Methods: Input: Generic API
- close
-
$fmt = $fmt->close();
close current input source, if any
- fromString
-
$fmt = $fmt->fromString(\$string);
select input from string $string
- fromFile
-
$fmt = $fmt->fromFile($filename_or_handle);
calls $fmt->fromFh()
- fromFh
-
$fmt = $fmt->fromFh($handle);
just calls $fmt->fromString()
- parseDocument
-
$doc = $fmt->parseDocument();
parses buffered XML::LibXML::Document; local override inserts $doc->{teibufr}, $doc->{textbufr} attributes for spliceback mode
Methods: Output: MIME & HTTP stuff
- shortName
-
$short = $fmt->shortName();
returns "official" short name for this format; override returns "tei".
- defaultExtension
-
$ext = $fmt->defaultExtension();
returns default filename extension for this format; override returns ".tei.xml".
Methods: Output: output selection
- flush
-
$fmt = $fmt->flush();
flush any buffered output to selected output source; override calls $fmt->buf2fh(\$fmt->{outbuf}, $fmt->{fh})
- toString
-
$fmt = $fmt->toString(\$str); $fmt = $fmt->toString(\$str,$formatLevel)
select output to byte-string; override reverts to DTA::CAB::Format::toString()
- toFile
-
$fmt_or_undef = $fmt->toFile($filename, $formatLevel);
select output to $filename; override reverts to DTA::CAB::Format::toFile().
- toFh
-
$fmt_or_undef = $fmt->toFh($fh,$formatLevel);
select output to filehandle $fh; override reverts to DTA::CAB::Format::toFh()
Methods: Output: Generic API
- putDocument
-
$fmt = $fmt->putDocument($doc);
override respects local 'keepc' and 'spliceback' flags
EXAMPLE
An example input file in the format as accepted by this module is:
<?xml version="1.0" encoding="UTF-8"?>
<TEI>
<text>
<fw>Running headers are ignored</fw>
Wie oede!<lb/>
</text>
</TEI>
An example output file in the format returned by this module is:
<?xml version="1.0" encoding="UTF-8"?>
<TEI>
<text>
<fw>Running headers are ignored</fw>
<s lang="de">
<w msafe="1" t="wie" errid="ec" hasmorph="1" exlex="wie" lang="de">
<moot word="wie" lemma="wie" tag="PWAV"/>
<xlit isLatinExt="1" isLatin1="1" latin1Text="wie"/>
</w>
<w msafe="0" t="oede">
<moot tag="ADJD" lemma="öde" word="öde"/>
<xlit isLatinExt="1" isLatin1="1" latin1Text="oede"/>
</w>
<w exlex="!" errid="ec" t="!" msafe="1">
<xlit latin1Text="!" isLatin1="1" isLatinExt="1"/>
<moot word="!" tag="$." lemma="!"/>
</w>
</s>
<lb/>
</text>
</TEI>
Any //s or //w elements in the input will be IGNORED and input will be (re-)tokenized. Outputs files are themselves parseable by DTA::CAB::Format::TEIws.
att.linguistic Example
An example output file in the format returned by this module with the att.linguistic
option set to a true value is:
<?xml version="1.0" encoding="UTF-8"?>
<TEI>
<text>
<fw>Running headers are ignored</fw>
<s xml:id="s1">
<w xml:id="w1" lemma="wie" pos="PWAV" norm="Wie">Wie</w>
<w xml:id="w2" lemma="öde" pos="ADJD" norm="öde" join="right">oede</w>
<w xml:id="w3" lemma="!" pos="$." norm="!" join="left">!</w>
</s>
<lb/>
</text>
</TEI>
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2011-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
dta-cab-analyze.perl(1), dta-cab-convert.perl(1), dta-cab-http-server.perl(1), dta-cab-http-client.perl(1), dta-cab-xmlrpc-server.perl(1), dta-cab-xmlrpc-client.perl(1), DTA::CAB::Server(3pm), DTA::CAB::Client(3pm), DTA::CAB::Format(3pm), DTA::CAB(3pm), perl(1), ...