NAME

DTA::TokWrap::Intro - a gentle introduction to the DTA::TokWrap distribution

DESCRIPTION

The DTA::TokWrap perl distribution contains various modules for operations associated with the tokenization of DTA "base-format" XML documents. The distribution is divided into 1 program and 3 main modules:

dta-tokwrap.perl

Top-level command-line interface. Use this if you can. See dta-tokwrap.perl(1) for details.

Module DTA::TokWrap

Top-level wrappers for persistent data associated with document tokenization. Encapsulates all default DTA::TokWrap::Processor sub-processor objects. See the DTA::TokWrap section for more details.

Module DTA::TokWrap::Document

Top-level wrappers for per-document data, including temporary files, indices, and in-memory data structures. See the DTA::TokWrap::Document section for more details.

Module DTA::TokWrap::Processor

Abstraction level for processing operations on document data. See the DTA::TokWrap::Processor section for more details.

These and other included modules are briefly described in the "MODULES" section, below.

MODULES

The following sections are intended to give a brief overview of the modules included with this distribution.

DTA::TokWrap

The DTA::TokWrap module provides top-level object-oriented wrappers for (batch) tokenization of DTA "base-format" XML documents. DTA::TokWrap objects encapsulate all default DTA::TokWrap::Processor objects under a single object. Most document processing should proceed via a DTA::TokWrap object.

See DTA::TokWrap(3pm) for more details.

DTA::TokWrap::Document

DTA::TokWrap::Document provides a perl class for representing a single DTA base-format XML file and associated indices, temporary files, and stand-off files. Together with the DTA::TokWrap module, this class comprises the top-level API of the DTA::TokWrap distribution.

See DTA::TokWrap::Document(3pm) for more details.

DTA::TokWrap::Document::Maker

***DEPRECATED***

DTA::TokWrap::Document::Maker provides an experimental DTA::TokWrap::Document subclass which attempts to perform make-like dependency tracking on document data keys.

See DTA::TokWrap::Document::Maker(3pm) for more details.

DTA::TokWrap::Processor

The DTA::TokWrap::Processor package provides an abstract base class which subsumes document-processing modules included in the DTA::TokWrap distribution.

See DTA::TokWrap::Processor(3pm) for details on the API.

DTA::TokWrap::Processor::mkindex

DTA::TokWrap::Processor::mkindex provides an object-oriented DTA::TokWrap::Processor wrapper around the dtatw-mkindex C program for DTA::TokWrap::Document objects.

See DTA::TokWrap::Processor::mkindex(3pm) for details.

DTA::TokWrap::Processor::mkbx0

DTA::TokWrap::Processor::mkindex provides an object-oriented DTA::TokWrap::Processor wrapper for hint insertion and serialization sort-key generation on a text-free "structure index" (.sx) XML file.

See DTA::TokWrap::Processor::mkbx0(3pm) for details.

DTA::TokWrap::Processor::mkbx

DTA::TokWrap::Processor::mkbx provides an object-oriented DTA::TokWrap::Processor wrapper for the creation of in-memory serialized text-block-indices.

See DTA::TokWrap::Processor::mkbx(3pm) for details.

DTA::TokWrap::Processor::tokenize

This class is just an abstract placeholder for a low-level tokenizer. By default, it attempts automatically detect a supported tokenizer on your system (preferably moot/WASTE). Depending on your needs, you may wish to use e.g. DTA::TokWrap::Processor::tokenize::waste or DTA::TokWrap::Processor::tokenize::http directly, or to set the package variable DTA::TokWrap::Processor::tokenize to the default tokenizer subclass name for your system.

DTA::TokWrap::Processor::tokenize provides an object-oriented DTA::TokWrap::Processor wrapper for the tokenization of serialized text files for DTA::TokWrap::Document objects.

See DTA::TokWrap::Processor::tokenize(3pm) for details.

DTA::TokWrap::Processor::tokenize::dummy

DTA::TokWrap::Processor::tokenize::dummy provides a package-local alternative to the "official" low-level tokenizer class DTA::TokWrap::Processor::tokenize.

See DTA::TokWrap::Processor::tokenize::dummy(3pm) for details.

DTA::TokWrap::Processor::tokenize1

DTA::TokWrap::Processor::tokenize1 provides an object-oriented DTA::TokWrap::Processor wrapper for some required and/or optional post-processing of tokenized files used by DTA::TokWrap::Document objects.

See DTA::TokWrap::Processor::tokenize1(3pm) for details.

DTA::TokWrap::Processor::tok2xml

DTA::TokWrap::Processor::tok2xml provides an object-oriented DTA::TokWrap::Processor wrapper for converting "raw" CSV-format (.t) low-level tokenizer output to a "master" tokenized XML (.t.xml) format, for use with DTA::TokWrap::Document objects.

See DTA::TokWrap::Processor::tok2xml(3pm) for details.

DTA::TokWrap::Processor::standoff

***OBSOLETE***

DTA::TokWrap::Processor::standoff provides an object-oriented DTA::TokWrap::Processor wrapper for generation of various standoff XML formats for DTA::TokWrap::Document objects.

See DTA::TokWrap::Processor::standoff(3pm) for details.

DTA::TokWrap::Processor::addws

DTA::TokWrap::Processor::standoff provides an object-oriented DTA::TokWrap::Processor wrapper for splicing tokenization data (word- and sentence-boundaries) back into a source TEI-XML file, potentially fragmenting words and/or sentences in the process. Each segment is assigned a unique id, and fragmented segments are associated using the TEI prev and next attributes.

See DTA::TokWrap::Processor::addws(3pm) for details.

DTA::TokWrap::Processor::idsplice

DTA::TokWrap::Processor::standoff provides an object-oriented DTA::TokWrap::Processor wrapper for splicing stand-off data into a base XML file by matching ids.

See DTA::TokWrap::Processor::idsplice(3pm) for details.

DTA::TokWrap::Base

DTA::TokWrap::Base provides an abstract base class for all object classes in the DTA::TokWrap distribution

See DTA::TokWrap::Base(3pm) for details.

DTA::TokWrap::CxData

DTA::TokWrap::CxData provides utilities for binary I/O on dta-tokwrap *.cx files.

See DTA::TokWrap::CxData(3pm) for details.

DTA::TokWrap::Logger

DTA::TokWrap::Logger provides an abstract base class for object-oriented access to the Log::Log4perl logging facility.

See DTA::TokWrap::Logger(3pm) for details.

DTA::TokWrap::Utils

DTA::TokWrap::Utils provides diverse assorted miscellaneous utilities which don't fit well anywhere else and which don't on their own justify the creation of a new package.

See DTA::TokWrap::Logger(3pm) for details.

DTA::TokWrap::Version

Version constants for DTA::TokWrap. Intended for (direct) use only by DTA::TokWrap sub-modules.

SEE ALSO

dta-tokwrap.perl(1), DTA::TokWrap(3pm), DTA::TokWrap::Document(3pm), DTA::TokWrap::Processor(3pm), ...

AUTHOR

Bryan Jurish <moocow@cpan.org>