NAME

DTA::CAB::Format - Base class for DTA::CAB::Datum I/O

SYNOPSIS

use DTA::CAB::Format;

##========================================================================
## Constructors etc.

$fmt = $CLASS_OR_OBJ->new(%args);
$fmt = $CLASS->newFormat($class_or_class_suffix, %opts);
$fmt = $CLASS->newReader(%opts);
$fmt = $CLASS->newWriter(%opts);

##========================================================================
## Methods: Global Format Registry

\%classReg_or_undef = $CLASS_OR_OBJ->registerFormat(%classRegOptions);
\%classReg_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);

$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);

$class_or_undef = $CLASS_OR_OBJ->shortReaderClass($shortname);
$class_or_undef = $CLASS_OR_OBJ->shortWriterClass($shortname);

$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);

##========================================================================
## Methods: Persistence

@keys = $class_or_obj->noSaveKeys();

##========================================================================
## Methods: MIME

$short = $fmt->shortName();
$type = $fmt->mimeType();
$ext = $fmt->defaultExtension();

##========================================================================
## Methods: Input

$fmt = $fmt->close();
$fmt = $fmt->fromString(\$string);
$fmt = $fmt->fromFile($filename);
$fmt = $fmt->fromFh($fh);
$doc = $fmt->parseDocument();
$doc = $fmt->parseString(\$str);
$doc = $fmt->parseFile($filename);
$doc = $fmt->parseFh($fh);
$doc = $fmt->forceDocument($reference);

##========================================================================
## Methods: Output

$lvl = $fmt->formatLevel();
$fmt = $fmt->flush();
$fmt_or_undef = $fmt->toString(\$str, $formatLevel);
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);
$fmt_or_undef = $fmt->toFh($fh, $formatLevel);
$fmt = $fmt->putDocument($doc);
$fmt = $fmt->putDocumentRaw($doc);

DESCRIPTION

DTA::CAB::Format is an abstract base class and API specification for objects implementing an I/O format for the DTA::CAB::Datum subhierarchy in general, and for DTA::CAB::Document objects in particular.

Each I/O format (subclass) has a characteristic abstract `base class' as well as optional `reader' and `writer' subclasses which perform the actual I/O (although in the current implementation, all reader/writer classes are identical with their respective base classes). Individual formats may be invoked either directly by their respective classes (SUBCLASS->new(), etc.), or by means of the global DTA::CAB::Format::Registry object $REG ("registerFormat", "newFormat", "newReader", "newWriter", etc.).

See "SUBCLASSES" for a list of common built-in formats and their registry data.

Globals

@ISA

DTA::CAB::Format inherits from DTA::CAB::Persistent and DTA::CAB::Logger.

$CLASS_DEFAULT

Default class returned by "newFormat"() if no known class is specified.

Variable: $REG

Default global format registry used, a DTA::CAB::Format::Registry object used by "registerFormat", "newFormat", etc.

Constructors etc.

new
$fmt = CLASS_OR_OBJ->new(%args);

Constructor.

%args, %$fmt:

##-- DTA::CAB::Format: common
##
##-- DTA::CAB::Format: input parsing
#(none)
##
##-- DTA::CAB::Format: output formatting
level    => $formatLevel,      ##-- formatting level, where applicable
outbuf   => $stringBuffer,     ##-- output buffer, where applicable
newFormat
$fmt = CLASS->newFormat($class_or_class_suffix, %opts);

Wrapper for "new"() which allows short class suffixes to be passed in as format names.

newReader
$fmt = CLASS->newReader(%opts);

Wrapper for DTA::CAB::Format::Registry::newReader which accepts %opts:

class => $class,    ##-- classname or DTA::CAB::Format:: suffix
file  => $filename, ##-- attempt to guess format from filename
newWriter
$fmt = CLASS->newWriter(%opts);

Wrapper for DTA::CAB::Format::Registry::newWriter which accepts %opts:

class => $class,    ##-- classname or DTA::CAB::Format:: suffix
file  => $filename, ##-- attempt to guess format from filename

Methods: Global Format Registry

The global format registry lives in the package variable $REG. The following methods are backwards-compatible wrappers for method calls to this registry object.

registerFormat
\%registered = $CLASS_OR_OBJ->registerFormat(%opts);

Registers a new format subclass; wrapper for DTA::CAB::Format::Registry::register().

guessFilenameFormat
\%registered_or_undef = $CLASS_OR_OBJ->guessFilenameFormat($filename);

Returns registration record for most recently registered format subclass whose filenameRegex matches $filename. Wrapper for DTA::CAB::Format::Registry::guessFilenameFormat().

fileReaderClass
$readerClass_or_undef = $CLASS_OR_OBJ->fileReaderClass($filename);

Attempts to guess reader class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileReaderClass().

fileWriterClass
$readerClass_or_undef = $CLASS_OR_OBJ->fileWriterClass($filename);

Attempts to guess writer class name from $filename. Wrapper for DTA::CAB::Format::Registry::fileWriterClass().

short2reg
$registered_or_undef = $CLASS_OR_OBJ->short2reg($shortname);

Gets the most recent subclass registry HASH ref for the short class name $shortname. Wrapper for DTA::CAB::Format::Registry::short2reg().

base2reg
$registered_or_undef = $CLASS_OR_OBJ->base2reg($basename);

Gets the most recent subclass registry HASH ref for the claass basename name $basename. Wrapper for DTA::CAB::Format::Registry::base2reg().

Methods: Persistence

noSaveKeys
@keys = $class_or_obj->noSaveKeys();

Returns list of keys not to be saved This implementation ignores the key outbuf, which is used by some many writer subclasses.

Methods: MIME

shortName
$short = $fmt->shortName();

Get short name for $fmt. Default just returns lower-cased DTA::CAB::Format:: class suffix. Short names are all lower-case by default.

mimeType
$type = $fmt->mimeType();

Returns MIME type for $fmt. Default returns 'text/plain'.

defaultExtension
$ext = $fmt->defaultExtension();

Returns default filename extension for $fmt (default='.cab').

Methods: Input

close
$fmt = $fmt->close();
$fmt = $fmt->close($savetmp);

Close current input source, if any. Default implementation calls $fmt->{tmpfh}->close() iff available and $savetmp is false (default). Always deletes @$fmt{qw(fh doc)}.

fromString
$fmt = $fmt->fromString(\$string);

Select input from the string $string. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh).

fromFile
$fmt = $fmt->fromFile($filename);

Select input from file $filename. Default implementation calls $fmt->fromFh($fmt->{tmpfh}=$new_fh)().

fromFh
$fmt = $fmt->fromFh($fh);

Select input from open filehandle $fh. Default implementation just calls $fmt->close(1) and sets $fmt->{fh}=$fh.

fromFh_str
$fmt = $fmt->fromFh_str($handle);

Alternate fromFh() implementation which slurps contents of $fh and calls $fmt->fromString(\$str).

parseDocument
$doc = $fmt->parseDocument();

Parse document from currently selected input source.

parseString
$doc = $fmt->parseString($str);

Wrapper for $fmt->fromString($str)->parseDocument().

parseFile
$doc = $fmt->parseFile($filename_or_fh);

Wrapper for $fmt->fromFile($filename_or_fh)->parseDocument()

parseFh
$doc = $fmt->parseFh($fh);

Wrapper for $fmt->fromFh($filename_or_fh)->parseDocument()

forceDocument
$doc = $fmt->forceDocument($reference);

Attempt to tweak $reference into a DTA::CAB::Document. This is a slightly more in-depth version of DTA::CAB::Datum::toDocument(). Current supported $reference forms are:

DTA::CAB::Document object

returned literally

DTA::CAB::Sentence object

returns a new document with a single sentence $reference.

DTA::CAB::Token object

returns a new document with a single token $reference.

non-reference

returns a new document with a single token whose 'text' key is $reference.

HASH reference with 'body' key

returns a bless()ed $reference as a DTA::CAB::Document.

HASH reference with 'tokens' key

returns a new document with the single sentence $reference

HASH reference with 'text' key

returns a new document with the single token $reference

ARRAY reference with non-reference initial element

returns a new document with a single sentence whose 'tokens' field is set to $reference.

... anything else

will cause a warning to be emitted and $reference to be returned as-is.

Methods: Output

formatLevel
$lvl = $fmt->formatLevel();
$fmt = $fmt->formatLevel($level)

Get/set output formatting level.

flush
$fmt = $fmt->flush();

Flush any buffered output to selected output source. Default implementation deletes $fmt->{outbuf} and calls $fmt->{fh}->flush() if available.

toString
$fmt = $fmt->toString(\$str);
$fmt = $fmt->toString(\$str,$formatLevel)

Select output to byte-string $str. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).

toString_buf
$fmt_or_undef = $fmt->toString_buf(\$str)

Alternate toString() implementation which sets $str=$fmt->{outbuf}.

toFile
$fmt_or_undef = $fmt->toFile($filename_or_handle, $formatLevel);

Select output to named file $filename. Default implementation just wraps $fmt->toFh($fmt->{tmpfh}=$new_fh, $level).

toFh
$fmt_or_undef = $fmt->toFh($fh,$formatLevel);

Select output to an open filehandle $fh. Default implementation just calls $fmt->formatLevel($level) and sets $fmt->{fh}=$fh.

putToken
$fmt = $fmt->putToken($tok);

Append a token to the selected output sink.

Should be non-destructive for $tok.

No default implementation, but default implementations of other methods assume output is concatenated onto $fmt->{outbuf}.

putTokenRaw
$fmt = $fmt->putTokenRaw($tok)

Copy-by-reference version of "putToken". Default implementation just calls $fmt->putToken($tok).

putSentence
$fmt = $fmt->putSentence($sent)

Append a sentence to the selected output sink.

Should be non-destructive for $sent.

Default implementation just iterates $fmt->putToken() & appends 1 additional "\n" to $fmt->{outbuf}.

putSentenceRaw
$fmt = $fmt->putSentenceRaw($sent)

Copy-by-reference version of "putSentence". Default implementation just calls "putSentence".

putDocument
$fmt = $fmt->putDocument($doc);

Append document contents to the selected output sink.

Should be non-destructive for $doc.

Default implementation just iterates $fmt->putSentence()

putDocumentRaw
$fmt = $fmt->putDocumentRaw($doc);

Copy-by-reference version of "putDocument".

SUBCLASSES

The following formats are provided by the default distribution. In some cases, external dependencies are also required which may not be available on all systems.

DTA::CAB::Format::Builtin

Just a convenience package: load all built-in DTA::CAB::Format subclasses.

DTA::CAB::Format::ExpandList

Formatter for runtime term expansion, for use e.g. with DDC Cab Expander, registerd as:

name=>__PACKAGE__, short=>'xl', filenameRegex=>qr/\.(?i:xl|xlist|l|lst)$/
DTA::CAB::Format::CONLLU

Datum parser|formatter for "vertical" text conforming to the CONLL-U format, with optional special handling for additional MISC fields, including json=JSON for embedding DTA::CAB::Format::TJ CAB-token structure. Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:conllu|conll[_-]u|cab[\.-]connlu|cab[\.-]conll[\.-]u)$/

Aliases: conllu conll-u cab-conllu cab-conll-u

DTA::CAB::Format::JSON

Abstract datum parser|formatter for JSON I/O. Transparently wraps one of the DTA::CAB::Format::JSON::XS or DTA::CAB::Format::JSON::Syck classes, depending on the availability of the underlying Perl modules (JSON::XS and JSON::Syck, respectively). If you have the JSON::XS module installed, this module provides the fastest I/O of all available human-readable format classes. Registered as:

name=>__PACKAGE__, short=>'json', filenameRegex=>qr/\.(?i:json|jsn)$/
DTA::CAB::Format::LemmaList

Formatter for runtime term lemmatization, for use e.g. with DDC Cab Expander. By default, returns all lemmata for function word input tokens (whose tag matches the regex /^(?:[CKP\$]|A[PR]|V[AM])/), otherwise only the "best" lemma. Regisered as:

(name=>__PACKAGE__, short=>$_, filenameRegex=>qr/\.(?i:ll|llist|lemmas|lemmata)/)
 foreach (qw(LemmaList llist ll lemma))

A variant which returns all known lemmata for each input token is registered as:

(name=>__PACKAGE__, short=>$_, opts=>{cctagre=>''})
 foreach (qw(LemmaListAll LemmasAll llist-all ll-all lla lemmas lemmata))
DTA::CAB::Format::Null

Null-op parser/formatter for debugging and testing purposes. Registered as:

name=>__PACKAGE__
DTA::CAB::Format::Perl

Datum parser|formatter: perl code via Data::Dumper, eval(). Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:prl|pl|perl|dump)$/
DTA::CAB::Format::Raw

Abstract only format for reading raw untokenized text and writing simple flat list of canonical forms; wraps DTA::CAB::Format::Raw::Waste by default. Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:raw)$/
DTA::CAB::Format::Raw::HTTP

Input-only format for reading raw untokenized text and analyzing it over HTTP using a remote WASTE FastCGI interface, registered as:

name=>__PACKAGE__, short=>'raw-http', filenameRegex=>qr/\.(?i:raw-http|txt-http)$/
DTA::CAB::Format::Raw::Perl

Input-only format for reading raw untokenized text and analyzing it using simple pure-perl heuristics. Registered as:

name=>__PACKAGE__, short=>'raw-perl', filenameRegex=>qr/\.(?i:raw-perl|txt-perl)$/
DTA::CAB::Format::Raw::Waste

Input-only format for reading raw untokenized text and analyzing it using the Moot::Waste module, registered as:

name=>__PACKAGE__, short=>'raw-waste', filenameRegex=>qr/\.(?i:raw-waste|txt-waste)$/
DTA::CAB::Format::Storable

Binary datum parser|formatter using the Storable module. Very fast, but neither human-readable nor easily portable beyond Perl. Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:sto|bin)$/
DTA::CAB::Format::SynCoPe::CSV

Datum parser|formatter for SynCoPe named entity recognizer -tab_input mode. Registered as:

name=>__PACKAGE__, short=>'syncope-csv', filenameRegex=>qr/\.(?i:syn(?:cope)?[-\.](?:csv|tsv|tab)|)$/
DTA::CAB::Format::TCF

Datum parser|formatter for CLARIN-D TCF XML. Handles annoation layers tokens, sentences, orthography, postags, and lemmas. Registered as:

(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:tcf[\.\-_]?xml)|(?:tcf))$/)
(name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography'}) foreach (qw(tcf-orth tcf-web))
(name=>__PACKAGE__, short=>$_, opts=>{tcflayers=>'tokens sentences orthography postags lemmas'}) foreach (qw(tcf tcf-xml tcfxml full-tcf xtcf))
DTA::CAB::Format::TEI

Datum parser|formatter: for raw un-tokenized TEI XML (with or without //c elements) using DTA::TokWrap. Any //s or //w elements in the input will be IGNORED and input will be (re-)tokenized. Outputs files are themselves parseable by DTA::CAB::Format::TEIws. Registered as:

(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:c|chr|txt|tei(?:[\.\-_]?p[45])?)[\.\-_]xml|xml)$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(chr-xml c-xml cxml tei-xml teixml tei xml))

By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following additional aliases are provided for using the DTA::CAB::Format::XmlTokWrapFast module to format the low-level flat token data (faster but not as flexible as the default):

(name=>__PACKAGE__, short=>$_, opts=>{txmlfmt=>'DTA::CAB::Format::XmlTokWrapFast'})
    foreach (qw(fast-tei-xml ftei-xml fteixml ftei))

Additionally, the following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:

(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1})
  foreach (qw(ling-tei-xml ltei-xml lteixml ltei tei-ling tei+ling teiling))
DTA::CAB::Format::TEIws

Datum parser|formatter: for TEI XML pre-tokenized into (possibly fragmented) //w and //s elements, as output by DTA::TokWrap. Registered as:

(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:(?:spliced|tei[\.\-\+]?ws?|wst?)[\.\-]xml)$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(tei-ws tei+ws tei+w tei-w teiw wst-xml wstxml teiws-xml));

By default, this module uses DTA::CAB::Format::XmlTokWrap to format the low-level document data, and splices the result back into the original TEI document. The following aliases are provided for using the DTA::CAB::Format::XmlLing to format the low-level flat token data using TEI att.linguistic conventions:

(name=>__PACKAGE__, short=>$_, opts=>{'att.linguistic'=>1})
  foreach (qw(lteiws teilws teiwsl ltei-ws ltei+ws tei+w ltei-w lteiw lwst-xml lwstxml lteiws-xml),
           qw(ling-tei-ws tei+ling+ws tei+ws+ling teiws-ling-xml teiws+ling-xml))
DTA::CAB::Format::Text

Datum parser|formatter: verbose human-readable text Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:txt|text|cab\-txt|cab\-text)$/
DTA::CAB::Format::TJ

Datum parser|formatter: "vertical" text, one token per line, with a single TAB-separated attribute field encoding token data as JSON. Registered as:

(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:tj|tjson|cab\-tj|cab\-tjson)$/);
DTA::CAB::Format::TT

Datum parser|formatter: "vertical" text, one token per line, TAB-separated attribute fields with conventional attribute-name prefixes. Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:t|tt|ttt|cab\-t|cab\-tt|cab\-ttt)$/
DTA::CAB::Format::YAML

Abstract datum parser|formatter for YAML I/O. Transparently wraps one of the DTA::CAB::Format::YAML::XS, DTA::CAB::Format::YAML::Syck, or DTA::CAB::Format::YAML::Lite classes, depending on the availability of the underlying Perl modules (YAML::XS, YAML::Syck, and YAML::Lite, respectively). Registered as:

name=>__PACKAGE__, short=>'yaml', filenameRegex=>qr/\.(?i:yaml|yml)$/
DTA::CAB::Format::XmlCommon

Datum parser|formatter: XML: abstract base class.

DTA::CAB::Format::XmlNative

Datum parser|formatter: minimalistic flat TokWrap-like XML using only TEI att.linguistic attributes. Based on DTA::CAB::Format::XmlTokWrapFast, the XmlLing parser reads and writes only IDs and the TEI att.linguistic attributes, (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html)). Registered as:

(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:(?:ling|l[tuws])(?:\.?)xml))$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(ltxml lxml ling-xml lt-xml ltwxml ltw-xml))
DTA::CAB::Format::XmlNative

Datum parser|formatter: XML (native). Nearly compatible with .t.xml files as created by dta-tokwrap.perl(1). Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml\-native|xml\-dta\-cab|(?:dta[\-\._]cab[\-\._]xml)|xml)$/

and aliased as:

name=>__PACKAGE__, short=>'xml'
DTA::CAB::Format::XmlPerl

Datum parser|formatter: XML (perl-like). Not really reccommended. Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)perl|perl(?:[\-\.]?)xml)$/
DTA::CAB::Format::XmlRpc

Datum parser|formatter: XML-RPC data structures using RPC::XML. Much too bloated to be of any real practical use. Registered as:

name=>__PACKAGE__, filenameRegex=>qr/\.(?i:xml(?:\-?)rpc|rpc(?:[\-\.]?)xml)$/
DTA::CAB::Format::XmlTokWrap

Datum parser|formatter(s): XML as read/written by DTA::TokWrap.

(name=>__PACKAGE__, filenameRegex=>qr/\.(?i:[tuws]\.?xml)$/)
(name=>__PACKAGE__, short=>$_) foreach (qw(txml t-xml twxml tw-xml))
DTA::CAB::Format::XmlTokWrapFast

Datum parser|formatter(s): XML as read/written by DTA::TokWrap. Unlike the XmlTokWrap format, the XmlTokWrapFast class does not read and/or write the full document structure, but rather restricts itself to a finite hard-coded subset of the most commonly used document-, sentence-, and token-level attributes. The input parser uses the expat-based XML::Parser module, which usually results in much faster and memory-friendlier document parsing than offered by the XmlTokWrap class. Registered as:

(name=>__PACKAGE__, filenameRegex=>qr/(?:\.(?i:f[tuws](?:\.?)xml))$/);
(name=>__PACKAGE__, short=>$_) foreach (qw(ftxml ft-xml ftwxml ftw-xml))

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.