NAME
DTA::CAB::Format - Base class for DTA::CAB::Datum I/O
SYNOPSIS
use DTA::CAB::Format;
##========================================================================
## Constructors etc.
$fmt = $CLASS_OR_OBJ->new(%args);
$fmt = $CLASS->newFormat($class_or_class_suffix, %opts);
$fmt = $CLASS->newReader(%opts);
$fmt = $CLASS->newWriter(%opts);
##========================================================================
## Methods: Global Format Registry
\%classReg_or_undef = $CLASS_OR_OBJ->registerFormat(%classRegOptions);
\%classReg_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);
$class_or_undef = $CLASS_OR_OBJ->shortReaderClass($shortname);
$class_or_undef = $CLASS_OR_OBJ->shortWriterClass($shortname);
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);
##========================================================================
## Methods: Persistence
@keys = $class_or_obj->noSaveKeys();
##========================================================================
## Methods: MIME
$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();
##========================================================================
## Methods: Input
$fmt = $fmt->close();
$fmt = $fmt->fromString(\$string);
$fmt = $fmt->fromFile($filename);
$fmt = $fmt->fromFh($fh);
$doc = $fmt->parseDocument();
$doc = $fmt->parseString(\$str);
$doc = $fmt->parseFile($filename);
$doc = $fmt->parseFh($fh);
$doc = $fmt->forceDocument($reference);
##========================================================================
## Methods: Output
$lvl = $fmt->formatLevel();
$fmt = $fmt->flush();
$fmt_or_undef = $fmt->toString(\$str, $formatLevel);
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
$fmt_or_undef = $fmt->toFh($fh, $formatLevel);
$fmt = $fmt->putDocument($doc);
$fmt = $fmt->putDocumentRaw($doc);
DESCRIPTION
DTA::CAB::Format is an abstract base class and API specification for objects implementing an I/O format for the DTA::CAB::Datum subhierarchy in general, and for DTA::CAB::Document objects in particular.
Each I/O format (subclass) has a characteristic abstract `base class' as well as optional `reader' and `writer' subclasses which perform the actual I/O (although in the current implementation, all reader/writer classes are identical with their respective base classes). Individual formats may be invoked either directly by their respective classes (SUBCLASS->new(), etc.), or by means of the global DTA::CAB::Format::Registry object $REG ("registerFormat", "newFormat", "newReader", "newWriter", etc.).
See "SUBCLASSES" for a list of common built-in formats and their registry data.
Globals
- @ISA
-
DTA::CAB::Format inherits from DTA::CAB::Persistent and DTA::CAB::Logger.
- $CLASS_DEFAULT
-
Default class returned by "newFormat"() if no known class is specified.
- Variable: $REG
-
Default global format registry used, a DTA::CAB::Format::Registry object used by "registerFormat", "newFormat", etc.
Constructors etc.
- new
-
$fmt = CLASS_OR_OBJ->new(%args);
Constructor.
%args, %$fmt:
##-- DTA::CAB::Format: common ## ##-- DTA::CAB::Format: input parsing #(none) ## ##-- DTA::CAB::Format: output formatting level => $formatLevel, ##-- formatting level, where applicable outbuf => $stringBuffer, ##-- output buffer, where applicable
- newFormat
-
$fmt = CLASS->newFormat($class_or_class_suffix, %opts);
Wrapper for "new"() which allows short class suffixes to be passed in as format names.
- newReader
-
$fmt = CLASS->newReader(%opts);
Wrapper for DTA::CAB::Format::Registry::newReader which accepts %opts:
class => $class, ##-- classname or DTA::CAB::Format:: suffix file => $filename, ##-- attempt to guess format from filename
- newWriter
-
$fmt = CLASS->newWriter(%opts);
Wrapper for DTA::CAB::Format::Registry::newWriter which accepts %opts:
class => $class, ##-- classname or DTA::CAB::Format:: suffix file => $filename, ##-- attempt to guess format from filename
Methods: Global Format Registry
The global format registry lives in the package variable $REG. The following methods are backwards-compatible wrappers for method calls to this registry object.
- registerFormat
-
\%registered = $CLASS_OR_OBJ->registerFormat(%opts);
Registers a new format subclass; wrapper for DTA::CAB::Format::Registry::register().
- guessFilenameFormat
-
\%registered_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);
Returns registration record for most recently registered format subclass whose
filenameRegex
matches $filename. Wrapper for DTA::CAB::Format::Registry::guessFilenameFormat(). - fileReaderClass
-
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
Attempts to guess reader class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileReaderClass().
- fileWriterClass
-
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);
Attempts to guess writer class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileWriterClass().
- short2reg
-
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
Gets the most recent subclass registry HASH ref for the short class name $shortname. Wrapper for DTA::CAB::Format::Registry::short2reg().
- base2reg
-
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);
Gets the most recent subclass registry HASH ref for the claass basename name $basename. Wrapper for DTA::CAB::Format::Registry::base2reg().
Methods: Persistence
- noSaveKeys
-
@keys = $class_or_obj->noSaveKeys();
Returns list of keys not to be saved This implementation ignores the key
outbuf
, which is used by some many writer subclasses.
Methods: MIME
- shortName
-
$short = $fmt->shortName();
Get short name for $fmt. Default just returns lower-cased DTA::CAB::Format:: class suffix. Short names are all lower-case by default.
- mimeType
-
$type = $fmt->mimeType();
Returns MIME type for $fmt. Default returns 'text/plain'.
- defaultExtension
-
$ext = $fmt->defaultExtension();
Returns default filename extension for $fmt (default='.cab').
Methods: Input
- close
-
$fmt = $fmt->close(); $fmt = $fmt->close($savetmp);
Close current input source, if any. Default implementation calls $fmt->{tmpfh}->close() iff available and $savetmp is false (default). Always deletes @$fmt{qw(fh doc)}.
- fromString
-
$fmt = $fmt->fromString(\$string);
Select input from the string $string. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh).
- fromFile
-
$fmt = $fmt->fromFile($filename);
Select input from file $filename. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh)().
- fromFh
-
$fmt = $fmt->fromFh($fh);
Select input from open filehandle $fh. Default implementation just calls $fmt->close(1) and sets $fmt->{fh}=$fh.
- fromFh_str
-
$fmt = $fmt->fromFh_str($handle);
Alternate fromFh() implementation which slurps contents of $fh and calls $fmt->fromString(\$str).
- parseDocument
-
$doc = $fmt->parseDocument();
Parse document from currently selected input source.
- parseString
-
$doc = $fmt->parseString($str);
Wrapper for $fmt->fromString($str)->parseDocument().
- parseFile
-
$doc = $fmt->parseFile($filename_or_fh);
Wrapper for $fmt->fromFile($filename_or_fh)->parseDocument()
- parseFh
-
$doc = $fmt->parseFh($fh);
Wrapper for $fmt->fromFh($filename_or_fh)->parseDocument()
- forceDocument
-
$doc = $fmt->forceDocument($reference);
Attempt to tweak $reference into a DTA::CAB::Document. This is a slightly more in-depth version of DTA::CAB::Datum::toDocument(). Current supported $reference forms are:
- DTA::CAB::Document object
-
returned literally
- DTA::CAB::Sentence object
-
returns a new document with a single sentence $reference.
- DTA::CAB::Token object
-
returns a new document with a single token $reference.
- non-reference
-
returns a new document with a single token whose 'text' key is $reference.
- HASH reference with 'body' key
-
returns a bless()ed $reference as a DTA::CAB::Document.
- HASH reference with 'tokens' key
-
returns a new document with the single sentence $reference
- HASH reference with 'text' key
-
returns a new document with the single token $reference
- ARRAY reference with non-reference initial element
-
returns a new document with a single sentence whose 'tokens' field is set to $reference.
- ... anything else
-
will cause a warning to be emitted and $reference to be returned as-is.
Methods: Output
- formatLevel
-
$lvl = $fmt->formatLevel(); $fmt = $fmt->formatLevel($level)
Get/set output formatting level.
- flush
-
$fmt = $fmt->flush();
Flush any buffered output to selected output source. Default implementation deletes $fmt->{outbuf} and calls $fmt->{fh}->flush() if available.
- toString
-
$fmt = $fmt->toString(\$str); $fmt = $fmt->toString(\$str,$formatLevel)
Select output to byte-string $str. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).
- toString_buf
-
$fmt_or_undef = $fmt->toString_buf(\$str)
Alternate toString() implementation which sets $str=$fmt->{outbuf}.
- toFile
-
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
Select output to named file $filename. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).
- toFh
-
$fmt_or_undef = $fmt->toFh($fh,$formatLevel);
Select output to an open filehandle $fh. Default implementation just calls $fmt->formatLevel($level) and sets $fmt->{fh}=$fh.
Methods: Output: Recommended API
- putToken
-
$fmt = $fmt->putToken($tok);
Append a token to the selected output sink.
Should be non-destructive for $tok.
No default implementation, but default implementations of other methods assume output is concatenated onto $fmt->{outbuf}.
- putTokenRaw
-
$fmt = $fmt->putTokenRaw($tok)
Copy-by-reference version of "putToken". Default implementation just calls $fmt->putToken($tok).
- putSentence
-
$fmt = $fmt->putSentence($sent)
Append a sentence to the selected output sink.
Should be non-destructive for $sent.
Default implementation just iterates $fmt->putToken() & appends 1 additional "\n" to $fmt->{outbuf}.
- putSentenceRaw
-
$fmt = $fmt->putSentenceRaw($sent)
Copy-by-reference version of "putSentence". Default implementation just calls "putSentence".
- putDocument
-
$fmt = $fmt->putDocument($doc);
Append document contents to the selected output sink.
Should be non-destructive for $doc.
Default implementation just iterates $fmt->putSentence()
- putDocumentRaw
-
$fmt = $fmt->putDocumentRaw($doc);
Copy-by-reference version of "putDocument".
SUBCLASSES
The following formats are provided by the default distribution. In some cases, external dependencies are also required which may not be available on all systems.
- DTA::CAB::Format::Builtin
-
Just a convenience package: load all built-in DTA::CAB::Format subclasses.
- DTA::CAB::Format::ExpandList
-
Formatter for runtime term expansion, for use e.g. with DDC Cab Expander, registerd as:
name=>__PACKAGE__, short=>'xl', filenameRegex=>qr/\.(?i:xl|xlist|l|lst)$/
- DTA::CAB::Format::CONLLU
-
Datum parser|formatter for "vertical" text conforming to the
CONLL-U
format, with optional special handling for additionalMISC
fields, includingjson=JSON
for embedding DTA::CAB::Format::TJ CAB-token structure. Registered as:name=>__PACKAGE__, filenameRegex=>qr/\.(?i:conllu|conll[_-]u|cab[\.-]connlu|cab[\.-]conll[\.-]u)$/
Aliases:
conllu conll-u cab-conllu cab-conll-u
- DTA::CAB::Format::JSON
-
Abstract datum parser|formatter for JSON I/O. Transparently wraps one of the DTA::CAB::Format::JSON::XS or DTA::CAB::Format::JSON::Syck classes, depending on the availability of the underlying Perl modules (JSON::XS and JSON::Syck, respectively). If you have the JSON::XS module installed, this module provides the fastest I/O of all available human-readable format classes. Registered as:
name=>__PACKAGE__, short=>'json', filenameRegex=>qr/\.(?i:json|jsn)$/
- DTA::CAB::Format::LemmaList
-
Formatter for runtime term lemmatization, for use e.g. with DDC Cab Expander. By default, returns all lemmata for function word input tokens (whose tag matches the regex
/^(?:[CKP\$]|A[PR]|V[AM])/
), otherwise only the "best" lemma. Regisered as:(name=>__PACKAGE__, short=>$_, filenameRegex=>qr/\.(?i:ll|llist|lemmas|lemmata)/) foreach (qw(LemmaList llist ll lemma))
A variant which returns all known lemmata for each input token is registered as:
(name=>__PACKAGE__, short=>$_, opts=>{cctagre=>''}) foreach (qw(LemmaListAll LemmasAll llist-all ll-all lla lemmas lemmata))
- DTA::CAB::Format::Null
-
Null-op parser/formatter for debugging and testing purposes. Registered as:
name=>__PACKAGE__
- DTA::CAB::Format::Perl
-
Datum parser|formatter: perl code via Data::Dumper, eval(). Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:prl|pl|perl|dump)$/
- DTA::CAB::Format::Raw
-
Abstract only format for reading raw untokenized text and writing simple flat list of canonical forms; wraps DTA::CAB::Format::Raw::Waste by default. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:raw)$/
- DTA::CAB::Format::Raw::HTTP
-
Input-only format for reading raw untokenized text and analyzing it over HTTP using a remote WASTE FastCGI interface, registered as:
name=>__PACKAGE__, short=>'raw-http', filenameRegex=>qr/\.(?i:raw-http|txt-http)$/
- DTA::CAB::Format::Raw::Perl
-
Input-only format for reading raw untokenized text and analyzing it using simple pure-perl heuristics. Registered as:
name=>__PACKAGE__, short=>'raw-perl', filenameRegex=>qr/\.(?i:raw-perl|txt-perl)$/
- DTA::CAB::Format::Raw::Waste
-
Input-only format for reading raw untokenized text and analyzing it using the Moot::Waste module, registered as:
name=>__PACKAGE__, short=>'raw-waste', filenameRegex=>qr/\.(?i:raw-waste|txt-waste)$/
- DTA::CAB::Format::Storable
-
Binary datum parser|formatter using the Storable module. Very fast, but neither human-readable nor easily portable beyond Perl. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:sto|bin)$/
- DTA::CAB::Format::SynCoPe::CSV
-
Datum parser|formatter for SynCoPe named entity recognizer
-tab_input
mode. Registered as:name=>__PACKAGE__, short=>'syncope-csv', filenameRegex=>qr/\.(?i:syn(?:cope)?[-\.](?:csv|tsv|tab)|)$/
- DTA::CAB::Format::TCF
-
Datum parser|formatter for CLARIN-D TCF XML. Handles annoation layers tokens, sentences, orthography, postags, and lemmas. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:tcf[\.\-_]?xml)|(?:tcf))$/) (name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography'}) foreach (qw(tcf-orth tcf-web)) (name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography postags lemmas'}) foreach (qw(tcf tcf-xml tcfxml full-tcf xtcf))
- DTA::CAB::Format::TEI
-
Datum parser|formatter: for raw un-tokenized TEI XML (with or without //c elements) using DTA::TokWrap. Any //s or //w elements in the input will be IGNORED and input will be (re-)tokenized. Outputs files are themselves parseable by DTA::CAB::Format::TEIws. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:c|chr|txt|tei(?:[\.\-_]?p[45])?)[\.\-_]xml|xml)$/) (name=>__PACKAGE__, short=>$_) foreach (qw(chr-xml c-xml cxml tei-xml teixml tei xml))
By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following additional aliases are provided for using the DTA::CAB::Format::XmlTokWrapFast module to format the low-level flat token data (faster but not as flexible as the default):
(name=>__PACKAGE__, short=>$_, opts=>{txmlfmt=>'DTA::CAB::Format::XmlTokWrapFast'}) foreach (qw(fast-tei-xml ftei-xml fteixml ftei))
Additionally, the following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:
(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1}) foreach (qw(ling-tei-xml ltei-xml lteixml ltei tei-ling tei+ling teiling))
- DTA::CAB::Format::TEIws
-
Datum parser|formatter: for TEI XML pre-tokenized into (possibly fragmented) //w and //s elements, as output by DTA::TokWrap. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:spliced|tei[\.\-\+]?ws?|wst?)[\.\-]xml)$/) (name=>__PACKAGE__, short=>$_) foreach (qw(tei-ws tei+ws tei+w tei-w teiw wst-xml wstxml teiws-xml));
By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:
(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1}) foreach (qw(lteiws teilws teiwsl ltei-ws ltei+ws tei+w ltei-w lteiw lwst-xml lwstxml lteiws-xml), qw(ling-tei-ws tei+ling+ws tei+ws+ling teiws-ling-xml teiws+ling-xml))
- DTA::CAB::Format::Text
-
Datum parser|formatter: verbose human-readable text Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:txt|text|cab\-txt|cab\-text)$/
- DTA::CAB::Format::TJ
-
Datum parser|formatter: "vertical" text, one token per line, with a single TAB-separated attribute field encoding token data as JSON. Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:tj|tjson|cab\-tj|cab\-tjson)$/);
- DTA::CAB::Format::TT
-
Datum parser|formatter: "vertical" text, one token per line, TAB-separated attribute fields with conventional attribute-name prefixes. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:t|tt|ttt|cab\-t|cab\-tt|cab\-ttt)$/
- DTA::CAB::Format::YAML
-
Abstract datum parser|formatter for YAML I/O. Transparently wraps one of the DTA::CAB::Format::YAML::XS, DTA::CAB::Format::YAML::Syck, or DTA::CAB::Format::YAML::Lite classes, depending on the availability of the underlying Perl modules (YAML::XS, YAML::Syck, and YAML::Lite, respectively). Registered as:
name=>__PACKAGE__, short=>'yaml', filenameRegex=>qr/\.(?i:yaml|yml)$/
- DTA::CAB::Format::XmlCommon
-
Datum parser|formatter: XML: abstract base class.
- DTA::CAB::Format::XmlNative
-
Datum parser|formatter: minimalistic flat TokWrap-like XML using only TEI att.linguistic attributes. Based on DTA::CAB::Format::XmlTokWrapFast, the XmlLing parser reads and writes only IDs and the TEI att.linguistic attributes, (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html)). Registered as:
(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:(?:ling|l[tuws])(?:\.?)xml))$/) (name=>__PACKAGE__, short=>$_) foreach (qw(ltxml lxml ling-xml lt-xml ltwxml ltw-xml))
- DTA::CAB::Format::XmlNative
-
Datum parser|formatter: XML (native). Nearly compatible with
.t.xml
files as created by dta-tokwrap.perl(1). Registered as:name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml\-native|xml\-dta\-cab|(?:dta[\-\._]cab[\-\._]xml)|xml)$/
and aliased as:
name=>__PACKAGE__, short=>'xml'
- DTA::CAB::Format::XmlPerl
-
Datum parser|formatter: XML (perl-like). Not really reccommended. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)perl|perl(?:[\-\.]?)xml)$/
- DTA::CAB::Format::XmlRpc
-
Datum parser|formatter: XML-RPC data structures using RPC::XML. Much too bloated to be of any real practical use. Registered as:
name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)rpc|rpc(?:[\-\.]?)xml)$/
- DTA::CAB::Format::XmlTokWrap
-
Datum parser|formatter(s): XML as read/written by DTA::TokWrap.
(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:[tuws]\.?xml)$/) (name=>__PACKAGE__, short=>$_) foreach (qw(txml t-xml twxml tw-xml))
- DTA::CAB::Format::XmlTokWrapFast
-
Datum parser|formatter(s): XML as read/written by DTA::TokWrap. Unlike the
XmlTokWrap
format, the XmlTokWrapFast class does not read and/or write the full document structure, but rather restricts itself to a finite hard-coded subset of the most commonly used document-, sentence-, and token-level attributes. The input parser uses the expat-based XML::Parser module, which usually results in much faster and memory-friendlier document parsing than offered by the XmlTokWrap class. Registered as:(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:f[tuws](?:\.?)xml))$/); (name=>__PACKAGE__, short=>$_) foreach (qw(ftxml ft-xml ftwxml ftw-xml))
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2009-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.