DTA::CAB::Format - Base class for DTA::CAB::Datum I/O
use DTA::CAB::Format;
## Constructors etc.
$fmt = $CLASS_OR_OBJ->new(%args);
$fmt = $CLASS->newFormat($class_or_class_suffix, %opts);
$fmt = $CLASS->newReader(%opts);
$fmt = $CLASS->newWriter(%opts);
## Methods: Global Format Registry
\%classReg_or_undef = $CLASS_OR_OBJ->registerFormat(%classRegOptions);
\%classReg_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);
$class_or_undef = $CLASS_OR_OBJ->shortReaderClass($shortname);
$class_or_undef = $CLASS_OR_OBJ->shortWriterClass($shortname);
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);
## Methods: Persistence
@keys = $class_or_obj->noSaveKeys();
## Methods: MIME
$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();
## Methods: Input
$fmt = $fmt->close();
$fmt = $fmt->fromString(\$string);
$fmt = $fmt->fromFile($filename);
$fmt = $fmt->fromFh($fh);
$doc = $fmt->parseDocument();
$doc = $fmt->parseString(\$str);
$doc = $fmt->parseFile($filename);
$doc = $fmt->parseFh($fh);
$doc = $fmt->forceDocument($reference);
## Methods: Output
$lvl = $fmt->formatLevel();
$fmt = $fmt->flush();
$fmt_or_undef = $fmt->toString(\$str, $formatLevel);
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
$fmt_or_undef = $fmt->toFh($fh, $formatLevel);
$fmt = $fmt->putDocument($doc);
$fmt = $fmt->putDocumentRaw($doc);
DTA::CAB::Format is an abstract base class and API specification for objects implementing an I/O format for the DTA::CAB::Datum subhierarchy in general, and for DTA::CAB::Document objects in particular.
Each I/O format (subclass) has a characteristic abstract `base class' as well as optional `reader' and `writer' subclasses which perform the actual I/O (although in the current implementation, all reader/writer classes are identical with their respective base classes). Individual formats may be invoked either directly by their respective classes (SUBCLASS->new(), etc.), or by means of the global DTA::CAB::Format::Registry object $REG ("registerFormat", "newFormat", "newReader", "newWriter", etc.).
See "SUBCLASSES" for a list of common built-in formats and their registry data.
- @ISA
DTA::CAB::Format inherits from DTA::CAB::Persistent and DTA::CAB::Logger.
Default class returned by "newFormat"() if no known class is specified.
- Variable: $REG
Default global format registry used, a DTA::CAB::Format::Registry object used by "registerFormat", "newFormat", etc.
Constructors etc.
- new
$fmt = CLASS_OR_OBJ->new(%args);
%args, %$fmt:
##-- DTA::CAB::Format: common ## ##-- DTA::CAB::Format: input parsing #(none) ## ##-- DTA::CAB::Format: output formatting level => $formatLevel, ##-- formatting level, where applicable outbuf => $stringBuffer, ##-- output buffer, where applicable
- newFormat
$fmt = CLASS->newFormat($class_or_class_suffix, %opts);
Wrapper for "new"() which allows short class suffixes to be passed in as format names.
- newReader
$fmt = CLASS->newReader(%opts);
Wrapper for DTA::CAB::Format::Registry::newReader which accepts %opts:
class => $class, ##-- classname or DTA::CAB::Format:: suffix file => $filename, ##-- attempt to guess format from filename
- newWriter
$fmt = CLASS->newWriter(%opts);
Wrapper for DTA::CAB::Format::Registry::newWriter which accepts %opts:
class => $class, ##-- classname or DTA::CAB::Format:: suffix file => $filename, ##-- attempt to guess format from filename
Methods: Global Format Registry
The global format registry lives in the package variable $REG. The following methods are backwards-compatible wrappers for method calls to this registry object.
- registerFormat
\%registered = $CLASS_OR_OBJ->registerFormat(%opts);
Registers a new format subclass; wrapper for DTA::CAB::Format::Registry::register().
- guessFilenameFormat
\%registered_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);
Returns registration record for most recently registered format subclass whose
matches $filename. Wrapper for DTA::CAB::Format::Registry::guessFilenameFormat(). - fileReaderClass
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
Attempts to guess reader class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileReaderClass().
- fileWriterClass
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);
Attempts to guess writer class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileWriterClass().
- short2reg
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
Gets the most recent subclass registry HASH ref for the short class name $shortname. Wrapper for DTA::CAB::Format::Registry::short2reg().
- base2reg
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);
Gets the most recent subclass registry HASH ref for the claass basename name $basename. Wrapper for DTA::CAB::Format::Registry::base2reg().
Methods: Persistence
- noSaveKeys
@keys = $class_or_obj->noSaveKeys();
Returns list of keys not to be saved This implementation ignores the key
, which is used by some many writer subclasses.
Methods: MIME
- shortName
$short = $fmt->shortName();
Get short name for $fmt. Default just returns lower-cased DTA::CAB::Format:: class suffix. Short names are all lower-case by default.
- mimeType
$type = $fmt->mimeType();
Returns MIME type for $fmt. Default returns 'text/plain'.
- defaultExtension
$ext = $fmt->defaultExtension();
Returns default filename extension for $fmt (default='.cab').
Methods: Input
- close
$fmt = $fmt->close(); $fmt = $fmt->close($savetmp);
Close current input source, if any. Default implementation calls $fmt->{tmpfh}->close() iff available and $savetmp is false (default). Always deletes @$fmt{qw(fh doc)}.
- fromString
$fmt = $fmt->fromString(\$string);
Select input from the string $string. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh).
- fromFile
$fmt = $fmt->fromFile($filename);
Select input from file $filename. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh)().
- fromFh
$fmt = $fmt->fromFh($fh);
Select input from open filehandle $fh. Default implementation just calls $fmt->close(1) and sets $fmt->{fh}=$fh.
- fromFh_str
$fmt = $fmt->fromFh_str($handle);
Alternate fromFh() implementation which slurps contents of $fh and calls $fmt->fromString(\$str).
- parseDocument
$doc = $fmt->parseDocument();
Parse document from currently selected input source.
- parseString
$doc = $fmt->parseString($str);
Wrapper for $fmt->fromString($str)->parseDocument().
- parseFile
$doc = $fmt->parseFile($filename_or_fh);
Wrapper for $fmt->fromFile($filename_or_fh)->parseDocument()
- parseFh
$doc = $fmt->parseFh($fh);
Wrapper for $fmt->fromFh($filename_or_fh)->parseDocument()
- forceDocument
$doc = $fmt->forceDocument($reference);
Attempt to tweak $reference into a DTA::CAB::Document. This is a slightly more in-depth version of DTA::CAB::Datum::toDocument(). Current supported $reference forms are:
- DTA::CAB::Document object
returned literally
- DTA::CAB::Sentence object
returns a new document with a single sentence $reference.
- DTA::CAB::Token object
returns a new document with a single token $reference.
- non-reference
returns a new document with a single token whose 'text' key is $reference.
- HASH reference with 'body' key
returns a bless()ed $reference as a DTA::CAB::Document.
- HASH reference with 'tokens' key
returns a new document with the single sentence $reference
- HASH reference with 'text' key
returns a new document with the single token $reference
- ARRAY reference with non-reference initial element
returns a new document with a single sentence whose 'tokens' field is set to $reference.
- ... anything else
will cause a warning to be emitted and $reference to be returned as-is.
Methods: Output
- formatLevel
$lvl = $fmt->formatLevel(); $fmt = $fmt->formatLevel($level)
Get/set output formatting level.
- flush
$fmt = $fmt->flush();
Flush any buffered output to selected output source. Default implementation deletes $fmt->{outbuf} and calls $fmt->{fh}->flush() if available.
- toString
$fmt = $fmt->toString(\$str); $fmt = $fmt->toString(\$str,$formatLevel)
Select output to byte-string $str. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).
- toString_buf
$fmt_or_undef = $fmt->toString_buf(\$str)
Alternate toString() implementation which sets $str=$fmt->{outbuf}.
- toFile
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
Select output to named file $filename. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).
- toFh
$fmt_or_undef = $fmt->toFh($fh,$formatLevel);
Select output to an open filehandle $fh. Default implementation just calls $fmt->formatLevel($level) and sets $fmt->{fh}=$fh.
Methods: Output: Recommended API
- putToken
$fmt = $fmt->putToken($tok);
Append a token to the selected output sink.
Should be non-destructive for $tok.
No default implementation, but default implementations of other methods assume output is concatenated onto $fmt->{outbuf}.
- putTokenRaw
$fmt = $fmt->putTokenRaw($tok)
Copy-by-reference version of "putToken". Default implementation just calls $fmt->putToken($tok).
- putSentence
$fmt = $fmt->putSentence($sent)
Append a sentence to the selected output sink.
Should be non-destructive for $sent.
Default implementation just iterates $fmt->putToken() & appends 1 additional "\n" to $fmt->{outbuf}.
- putSentenceRaw
$fmt = $fmt->putSentenceRaw($sent)
Copy-by-reference version of "putSentence". Default implementation just calls "putSentence".
- putDocument
$fmt = $fmt->putDocument($doc);
Append document contents to the selected output sink.
Should be non-destructive for $doc.
Default implementation just iterates $fmt->putSentence()
- putDocumentRaw
$fmt = $fmt->putDocumentRaw($doc);
Copy-by-reference version of "putDocument".
The following formats are provided by the default distribution. In some cases, external dependencies are also required which may not be available on all systems.
- DTA::CAB::Format::Builtin
Just a convenience package: load all built-in DTA::CAB::Format subclasses.
- DTA::CAB::Format::ExpandList
Formatter for runtime term expansion, for use e.g. with DDC Cab Expander, registerd as:
name=>__PACKAGE__, short=>'xl', filenameRegex=>qr/\.(?i:xl|xlist|l|lst)$/
- DTA::CAB::Format::JSON
Abstract datum parser|formatter for JSON I/O. Transparently wraps one of the DTA::CAB::Format::JSON::XS or DTA::CAB::Format::JSON::Syck classes, depending on the availability of the underlying Perl modules (JSON::XS and JSON::Syck, respectively). If you have the JSON::XS module installed, this module provides the fastest I/O of all available human-readable format classes. Registered as:
name=>__PACKAGE__, short=>'json', filenameRegex=>qr/\.(?i:json|jsn)$/
- DTA::CAB::Format::LemmaList
Formatter for runtime term lemmatization, for use e.g. with DDC Cab Expander. By default, returns all lemmata for function word input tokens (whose tag matches the regex
), otherwise only the "best" lemma. Regisered as:(name=>__PACKAGE__, short=>$_, filenameRegex=>qr/\.(?i:ll|llist|lemmas|lemmata)/) foreach (qw(LemmaList llist ll lemma))
A variant which returns all known lemmata for each input token is registered as:
(name=>__PACKAGE__, short=>$_, opts=>{cctagre=>''}) foreach (qw(LemmaListAll LemmasAll llist-all ll-all lla lemmas lemmata))
- DTA::CAB::Format::Null
Null-op parser/formatter for debugging and testing purposes. Registered as:
- DTA::CAB::Format::Perl
Datum parser|formatter: perl code via Data::Dumper, eval(). Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:prl|pl|perl|dump)$/
- DTA::CAB::Format::Raw
Abstract input-only format for reading raw untokenized text, wraps DTA::CAB::Format::Raw:HTTP by default.
- DTA::CAB::Format::Raw::HTTP
Input-only format for reading raw untokenized text and analyzing it over HTTP using a remote WASTE FastCGI interface, registered as:
name=>__PACKAGE__, short=>'raw-http', filenameRegex=>qr/\.(?i:raw-http|txt-http)$/
- DTA::CAB::Format::Raw::Perl
Input-only format for reading raw untokenized text and analyzing it using simple pure-perl heuristics. Registered as:
name=>__PACKAGE__, short=>'raw-perl', filenameRegex=>qr/\.(?i:raw-perl|txt-perl)$/
- DTA::CAB::Format::Raw::Waste
Input-only format for reading raw untokenized text and analyzing it using the Moot::Waste module, registered as:
name=>__PACKAGE__, short=>'raw-waste', filenameRegex=>qr/\.(?i:raw-waste|txt-waste)$/);
- DTA::CAB::Format::Storable
Binary datum parser|formatter using the Storable module. Very fast, but neither human-readable nor easily portable beyond Perl. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:sto|bin)$/
- DTA::CAB::Format::SynCoPe::CSV
Datum parser|formatter for SynCoPe named entity recognizer
mode. Registered as:name=>__PACKAGE__, short=>'syncope-csv', filenameRegex=>qr/\.(?i:syn(?:cope)?[-\.](?:csv|tsv|tab)|)$/
- DTA::CAB::Format::TCF
Datum parser|formatter for CLARIN-D TCF XML. Handles annoation layers tokens, sentences, orthography, postags, and lemmas. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:tcf[\.\-_]?xml)|(?:tcf))$/) (name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography'}) foreach (qw(tcf-orth tcf-web)) (name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography postags lemmas'}) foreach (qw(tcf tcf-xml tcfxml full-tcf xtcf))
- DTA::CAB::Format::TEI
Datum parser|formatter: for raw un-tokenized TEI XML (with or without //c elements) using DTA::TokWrap. Any //s or //w elements in the input will be IGNORED and input will be (re-)tokenized. Outputs files are themselves parseable by DTA::CAB::Format::TEIws. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:c|chr|txt|tei(?:[\.\-_]?p[45])?)[\.\-_]xml|xml)$/) (name=>__PACKAGE__, short=>$_) foreach (qw(chr-xml c-xml cxml tei-xml teixml tei xml))
By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following additional aliases are provided for using the DTA::CAB::Format::XmlTokWrapFast module to format the low-level flat token data (faster but not as flexible as the default):
(name=>__PACKAGE__, short=>$_, opts=>{txmlfmt=>'DTA::CAB::Format::XmlTokWrapFast'}) foreach (qw(fast-tei-xml ftei-xml fteixml ftei))
Additionally, the following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:
(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1}) foreach (qw(ling-tei-xml ltei-xml lteixml ltei tei-ling tei+ling teiling))
- DTA::CAB::Format::TEIws
Datum parser|formatter: for TEI XML pre-tokenized into (possibly fragmented) //w and //s elements, as output by DTA::TokWrap. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:spliced|tei[\.\-\+]?ws?|wst?)[\.\-]xml)$/) (name=>__PACKAGE__, short=>$_) foreach (qw(tei-ws tei+ws tei+w tei-w teiw wst-xml wstxml teiws-xml));
By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:
(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1}) foreach (qw(lteiws teilws teiwsl ltei-ws ltei+ws tei+w ltei-w lteiw lwst-xml lwstxml lteiws-xml), qw(ling-tei-ws tei+ling+ws tei+ws+ling teiws-ling-xml teiws+ling-xml))
- DTA::CAB::Format::Text
Datum parser|formatter: verbose human-readable text Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:txt|text|cab\-txt|cab\-text)$/
- DTA::CAB::Format::TJ
Datum parser|formatter: "vertical" text, one token per line, with a single TAB-separated attribute field encoding token data as JSON. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:tj|tjson|cab\-tj|cab\-tjson)$/);
- DTA::CAB::Format::TT
Datum parser|formatter: "vertical" text, one token per line, TAB-separated attribute fields with conventional attribute-name prefixes. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:t|tt|ttt|cab\-t|cab\-tt|cab\-ttt)$/
- DTA::CAB::Format::YAML
Abstract datum parser|formatter for YAML I/O. Transparently wraps one of the DTA::CAB::Format::YAML::XS, DTA::CAB::Format::YAML::Syck, or DTA::CAB::Format::YAML::Lite classes, depending on the availability of the underlying Perl modules (YAML::XS, YAML::Syck, and YAML::Lite, respectively). Registered as:
name=>__PACKAGE__, short=>'yaml', filenameRegex=>qr/\.(?i:yaml|yml)$/
- DTA::CAB::Format::XmlCommon
Datum parser|formatter: XML: abstract base class.
- DTA::CAB::Format::XmlNative
Datum parser|formatter: minimalistic flat TokWrap-like XML using only TEI att.linguistic attributes. Based on DTA::CAB::Format::XmlTokWrapFast, the XmlLing parser reads and writes only IDs and the TEI att.linguistic attributes, ( Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:(?:ling|l[tuws])(?:\.?)xml))$/) (name=>__PACKAGE__, short=>$_) foreach (qw(ltxml lxml ling-xml lt-xml ltwxml ltw-xml))
- DTA::CAB::Format::XmlNative
Datum parser|formatter: XML (native). Nearly compatible with
files as created by dta-tokwrap.perl(1). Registered as:name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml\-native|xml\-dta\-cab|(?:dta[\-\._]cab[\-\._]xml)|xml)$/
and aliased as:
name=>__PACKAGE__, short=>'xml'
- DTA::CAB::Format::XmlPerl
Datum parser|formatter: XML (perl-like). Not really reccommended. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)perl|perl(?:[\-\.]?)xml)$/
- DTA::CAB::Format::XmlRpc
Datum parser|formatter: XML-RPC data structures using RPC::XML. Much too bloated to be of any real practical use. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)rpc|rpc(?:[\-\.]?)xml)$/
- DTA::CAB::Format::XmlTokWrap
Datum parser|formatter(s): XML as read/written by DTA::TokWrap.
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:[tuws]\.?xml)$/) (name=>__PACKAGE__, short=>$_) foreach (qw(txml t-xml twxml tw-xml))
- DTA::CAB::Format::XmlTokWrapFast
Datum parser|formatter(s): XML as read/written by DTA::TokWrap. Unlike the
format, the XmlTokWrapFast class does not read and/or write the full document structure, but rather restricts itself to a finite hard-coded subset of the most commonly used document-, sentence-, and token-level attributes. The input parser uses the expat-based XML::Parser module, which usually results in much faster and memory-friendlier document parsing than offered by the XmlTokWrap class. Registered as:(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:f[tuws](?:\.?)xml))$/); (name=>__PACKAGE__, short=>$_) foreach (qw(ftxml ft-xml ftwxml ftw-xml))
Bryan Jurish <>
Copyright (C) 2009-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.