DTA::CAB::WebServiceHowto - User documentation for DTA::CAB web-service
This document describes the use of the DTA::CAB web-service accessible at The CAB web-service provides error-tolerant linguistic analysis for historical German text, including normalization of historical orthographic variants to "canonical" modern forms using the method described in Jurish (2012).
Due to legal restrictions on some of the underlying resources, not all available analysis layers can be returned by the publically accessible "demo" web-service, but it is hoped that the available layers (linguistically salient TEI-XML serialization, sentence- and word-level tokenization, orthographic normalization, part-of-speech tags, and (normalized) lemmata) should suffice for most purposes.
Interface Elements
Upon accessing the top-level web-service URL ( ) in a web browser, the user is presented with a graphical interface in which CAB queries can be constructed and submitted to the underlying server. This section describes the various elements of that interface.

- Query Form
At the top of the web-service interface is a query form on a gray background including input fields for the CAB query parameters ("Query", "Analyzer", "Format", etc.). Each query form input element should display a tooltip briefly describing its function if you hover your mouse pointer over it in a browser. See "query Requests" in DTA::CAB::HttpProtocol for more details on supported query request parameters.
- Status Line
Immediately beneath the query form is a status display line ("URL line") on a white background with no border, which contains a link to the raw response data for the current query, if any. In the case of singleton (1-word) queries, the status line also contains a simple heuristic "traffic-light" indicator of the query word's morphological security status, where green indicates a "safe" known modern form, red indicates an unknown (assumedly historical) form, and yellow indicates a known modern form which is judged unsafe for identity canonicalization (typically a proper name).
- Response Data
Immediately below the URL line is the response data area ("data area") on a white background with a gray border, which displays the results for the current query, if any.
- Link Buttons
Below the data area are a number of static link buttons for the file upload ("File Upload") or the live user-input interface ("Live Query"), the list of analyzers supported by the underlying CAB instance ("Analyzers"), the list of I/O formats supported by the underlying CAB instance ("Formats"), administrative data for the CAB server instance ("Status"), and the CAB documentation ("Documentation").
Below the link button area is a short footer on a gray background containing administrative information about the underlying CAB server.
Basic Usage
This section briefly describes the basic usage offered by the DTA::CAB web-service by reference to some simple examples.
A Simple Query
Most CAB parameters in the query form are initialized with sensible default values, with the exception of the "Query" parameter itself, which should contain the text string to be analyzed. Say we wish to analyze the text string "Elephanten
": simply entering this string (or copy & paste it) into the text input box associated with the "Query" parameter, and then pressing the <i>Enter</i> key or clicking the "submit" button will cause the query to be submitted to the underlying CAB server and the response data to be displayed in the data area:
The results are displayed by default in CAB's native "Text" format, in which the first line contains the input surface form (Elephanten
), and the remaining lines are the CAB attributes for the query word, where each attribute line is indicated by an initial TAB character, a plus ("+") sign, and the attribute label enclosed in square brackets ("[...]"), followed by the attribute value. Useful attributes include "moot/word" (canonical modern form), "moot/tag" (part-of-speech tag), and "moot/lemma" (modern lemma).
The example query for instance should produce a response such as:
+[moot/word] Elefanten
+[moot/tag] NN
+[moot/lemma] Elefant
... indicating that the query was correctly normalized to the canonical modern form "Elefanten", tagged as a common noun ("NN"), and assigned the correct canonical lemma "Elefant".
A Sentence Query
Suppose we wish to take advantage of context information while normalizing a whole sentence of historical text, as described in Jurish (2012), Chapter 4. Simply enter the entire text of the sentence to be analyzed in the "Query" input, and ensure that the checkbox for the "tokenize" flag is checked, and submit the query, e.g.
EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
The output now contains multiple tokens (words), where the analysis for each token begins with a line containing only its surface form (no leading whitespace). Token attribute lines (introduced by a leading TAB character) refer to the token most recently introduced. The output for the example query can be directly accessed here, and should look something like the following:
+[moot/word] Ein
+[moot/tag] ART
+[moot/lemma] eine
+[moot/word] zahmer
+[moot/tag] ADJA
+[moot/lemma] zahm
+[moot/word] Elefant
+[moot/tag] NN
+[moot/lemma] Elefant
+[moot/word] gilt
+[moot/tag] VVFIN
+[moot/lemma] gelten
+[moot/word] ungefähr
+[moot/tag] ADJD
+[moot/lemma] ungefähr
+[moot/word] zweihundert
+[moot/tag] CARD
+[moot/lemma] zweihundert
+[moot/word] Taler
+[moot/tag] NN
+[moot/lemma] Taler
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
A Multi-Sentence Query
The CAB web service can analyze multiple sentences as well, for example:
EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
Ceterum censeo Carthaginem esse delendam.
The corresponding output can be viewed here, which should look something like:
%% $s:lang=de
+[moot/word] Ein
+[moot/tag] ART
+[moot/lemma] eine
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
%% $s:lang=la
+[moot/word] Ceterum
+[moot/lemma] ceterum
+[moot/word] censeo
+[moot/lemma] censeo
+[moot/word] Carthaginem
+[moot/lemma] carthaginem
+[moot/word] esse
+[moot/lemma] esse
+[moot/word] delendam
+[moot/lemma] delendam
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
Here, blank lines indicate sentence boundaries, and comments (non-tokens) are lines introduced by two percent signs ("%%"). The special comments immediately preceding each sentence of the form "%% $s:lang=
LANG" indicate the result of CAB's language-guessing module DTA::CAB::Analyzer::LangId::Simple. The blank line between the final "." of the first sentence and the first word of the second sentence indicates that the sentence boundary was correctly detected, and the "%% $s.lang=
LANG" comments indicate that the source language of both sentences was correctly guessed ("de" indicating German in the former case, and "la" indicating Latin in the latter). Due to the language-guesser's assignment for the second sentence, all words in that sentence are tagged as foreign material ("FM"), with the suffix ".la" indicating the language-guesser's output. Otherwise, no analysis (normalization or lemmatization) is performed for sentences recognized as non-German.
A File Query
In addition to the default "live" query interface, the CAB web-service interface also offers users the opportunity to upload an entire document file to be analyzed and allowing the analysis results to be saved to the user's local machine. The CAB file interface is accessible via the "File Upload" button in the link area, which resolves to Suppose we have a simple plain-text file elephant.raw containing the document to be analyzed:
EJn zamer Elephant gillt ohngefähr zweyhundert Thaler.
Ceterum censeo Carthaginem esse delendam.
First, save the input file elephant.raw to your local computer. Then, in the CAB file input form, click on on the "Choose File" button and select elephant.raw from wherever you saved it. Clicking on the "submit" button will cause the contents of the selected file to be sent to the CAB server, analyzed, and prompt you for a location to which the analysis results should be saved (by default elephant.raw.txt). Assuming the default options were active, you should have a result file resembling this, identical to the results displayed in the data area for the multi-sentence query example.
Due to bandwidth limitations, the CAB server currently only accepts input files of size <= 1 MB. If you need analyze a large amount of data, you will first need to split your input files into chunks of no more than 1 MB each, sending each chunk to the server individually. In this case, please refrain from "hammering" the CAB server with an uninterrupted stream of requests: wait at least 3-5 seconds between requests to avoid blocking the server for other users. Alternatively, if you need to analyze a large corpus, you can contact the Deutsches Textarchiv.
Analysis Chains
The CAB server supports a number of different analysis modes, corresponding to different sorts of input data and/or different user tasks. The various analysis modes are implemented in terms of different analysis chains (a.k.a. "analyzers" or just "chains") supported by the underlying analysis dispatcher class, DTA::CAB::Chain::DTA. The analysis mode to be used for a particular CAB request is specified by the "analyzer" or "a" parameter, which is initially set to use the "default" analysis chain (which is itself just an alias for the "norm" chain).
This section briefly describes some alternative analysis chains and situations in which they might be useful. For a full list of available analysis chains, see the list returned by the "Analyzers" button in the link area, and see DTA::CAB::Chain::DTA for a list of the available atomic analyzers and aliases for complex analysis chains. For details on individual atomic analyzers, see the appropriate DTA::CAB::Analyzer subclass documentation.
Type-wise Analysis
As noted above, the default "norm" analysis chain uses sentential context to improve the precision of the normalization process as described in Jurish (2012), Chapter 4. This behavior is not always desirable, however. In particular, if your data is not arranged into linguistically meaningful sentence-like units -- e.g. a simple flat list of surface types -- then no real context information is available, and the "sentential" context CAB would use would more likely hinder the normalization than help it. For such cases, the "norm1" analysis chain can be employed instead of the default "norm" chain. The "norm1" chain uses only unigram-based probabilities during normalization, so is less likely to be "confused" by non-sentence-like inputs.
Consider for example the input:
Fliegende Fliegen Fliegen Fliegenden Fliegen Nach.
passing this list to the "norm1" chain inhibits context-dependent processing and results in the following
+[moot/word] Fliegende
+[moot/tag] ADJA
+[moot/lemma] fliegend
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
+[moot/word] Fliegenden
+[moot/tag] NN
+[moot/lemma] Fliegende
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
+[moot/word] nach
+[moot/tag] APPR
+[moot/lemma] nach
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
contrast this to the output of the default "norm" chain
+[moot/word] Fliegende
+[moot/tag] ADJA
+[moot/lemma] fliegend
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
+[moot/word] Fliegen
+[moot/tag] VVFIN
+[moot/lemma] fliegen
+[moot/word] Fliegenden
+[moot/tag] ADJA
+[moot/lemma] fliegend
+[moot/word] Fliegen
+[moot/tag] NN
+[moot/lemma] Fliege
+[moot/word] Nach
+[moot/tag] PTKVZ
+[moot/lemma] nach
+[moot/word] .
+[moot/tag] $.
+[moot/lemma] .
In the above example, all instances of the surface form "Fliegen" are analyzed as common nouns (NN) with lemma "Fliege" by the unigram-based analyzer "norm1". If sentential context is considered, the second instance of "Fliegen" is correctly analyzed as a finite verb form (VVFIN) of the lemma "fliegen". Similarly, the unigram-based analyzer mis-tags "Fliegenden" as a noun rather than an attributive adjective (NN vs. ADJA) and assigns a corresponding (incorrect) lemma "Fliegende". The final particle "nach" is mis-tagged as a preposition (APPR vs. PTKVZ) by the unigram-based model, but this has no effect on the lemma assigned. Although use of the "norm1" analyzer does not alter any canonical modern forms in this example, such cases are possible.
Term Expansion
It is sometimes useful to have a list of all known orthographic variants of a given input form, e.g. for runtime queries of a database which indexes only surface forms. For such tasks, the analysis chain "expand" can be used. To see all the variants of the surface form "Elephant" in the Deutsches Textarchiv corpus for example, one could query, and expect a response something like:
+[moot/word] Elefant
+[moot/tag] NN
+[moot/lemma] Elefant
+[eqpho] Elephant <0>
+[eqpho] Elefant <14>
+[eqpho] elephant <17>
+[eqpho] elevant <17>
+[eqpho] Elephand <18>
+[eqpho] Elevant <18>
+[eqpho] elefant <18>
+[eqpho] Elephandt <19>
+[eqpho] Elephanth <19>
+[eqrw] Elefant <0>
+[eqrw] Elephant <0>
+[eqrw] Elephandt <8.44527626037598>
+[eqrw] elefant <8.44683265686035>
+[eqrw] Elephanth <8.70806312561035>
+[eqrw] elephant <9.01417255401611>
+[eqrw] Elephand <18.6624526977539>
+[eqrw] Eliphant <18.7045001983643>
+[eqrw] Elephants <21.1982593536377>
+[eqrw] elevant <21.3945064544678>
+[eqrw] Elphant <23.2134704589844>
+[eqrw] Elesant <27.7278366088867>
+[eqrw] Elephanta <30.2710800170898>
+[eqlemma] Elefannten <0>
+[eqlemma] Elefant <0>
+[eqlemma] Elefanten <0>
+[eqlemma] Elefantin <0>
+[eqlemma] Elefantine <0>
+[eqlemma] Elephandten <0>
+[eqlemma] Elephant <0>
+[eqlemma] Elephanten <0>
+[eqlemma] Elesant <0>
+[eqlemma] elefant <0>
+[eqlemma] elephanten <0>
Here, the "eqpho" attribute contains all surface forms recognized as phonetic variants of the query term, "eqrw" contains those surface forms recognized as variants by the heuristic rewrite cascade, and "eqlemma" contains the surface forms most likely to be mapped to the same modern lemma as the query term. This online expansion strategy is used by the DTA Query Lizard, and was also used by an earlier version of the DTA corpus index as described in Jurish et al. (2014), but has since been replaced there by an online lemmatization query using the "lemma" expander, in conjunction with a direct query of the underlying corpus $Lemma index.
The request includes the tokenize=0
option, which informs the CAB server that the query does not need to be tokenized, effectively forcing use of the qd
parameter to the low-level service. This is generally a good idea when using single-token queries or pre-tokenized documents, since it speeds up processing.
Date-optimized Analysis
As of DTA::CAB v1.78, the DTA Dispatcher includes specialized rewrite models (rw.1600-1700
, rw.1700-1800
, rw.1800-1900
), and provides a number of high-level convenience chains (norm.1600-1700
, norm1.1600-1700
, etc.) using these models instead of the default "generic" rewrite cascade (rw
) to provide canonicalization hypotheses for unknown words. The weights for the specialized rewrite models were trained on a modest number of manually assigned canonicalization pairs from the period in question extracted from the CabErrorDb error database (4,000-8,000 pair types per model), and may provide a slight improvement in canonicalization accuracy with respect to the generic model, provided that you specify the appropriate analysis chain ("Analyzer") in your request. Compare for example the outputs of the various chains for the input forms avf, Auffichten, and Büberchens:
Format Conversion
The CAB server can be used to convert between various supported IO Formats. In this mode, no analysis is performed on the input data (with the exception of tokenization for raw untokenized input), but the input document is parsed and re-formatted according to the selected output format. The analysis chain "null" can be selected for such tasks. To tokenize a simple text string for instance, you can select the "null" analyzer and the "text" format, and expect output such as this.
This mode of operation is mostly useful in conjunction with file upload queries to convert analyzed files. If you only need to tokenize raw text files, consider using the more efficient WASTE tokenizer web-service directly, or the WebLicht tool-chainer, which offers a number of different tokenizer components.
I/O Formats
The CAB web-service supports a number of different input- and output-formats for document data. This section presents a brief outline of some of the more popular formats. See "SUBCLASSES" in DTA::CAB::Format for a list of currently implemented format subclasses, and see the "Formats" link in the CAB web-service interface link area for a list of format aliases supported by the server. Formatted input documents are passed to the low-level service using the qd
query parameter, the use of which is controlled by the tokenize
option in the Query Form section of the graphical interface.
Text-based Formats
CAB supports various text-based formats for human consumption and/or further processing. While typically not as flexible or efficient as the "pure" data-oriented formats described below, CAB's native text-based formats offer a reasonable compromise between human- and machine-readability. All text-based CAB formats expect and return data encoded in UTF-8, without a Byte-order mark.
- Text
Simple human-readable text format as described under "Basic Usage", above. Blank lines indicate sentence boundaries, comments are lines beginning with "%%", a line with no leading whitespace contains the surface text of a new token (word), and subsequent token-lines are attribute values beginning with a TAB character and a plus sign ("+"), followed by the attribute label enclosed in square brackets "[...]" and the attribute value as a text-string.
Primarily useful for direct inspection and debugging.
Simple fixed-width "vertical" text format containing only selected attribute values. Each line is either a comment introduced by "%%", an empty line indicating a sentence boundary, or a TAB-separated token line. Token lines are of the form
where SURFACE_TEXT is the surface form of the token, XLIT_TEXT is the result of a simple deterministic transliteration using unicruft, CANON_TEXT is the automatically determined canonical modern form for the token, POS_TAG is the part-of-speech tag assigned by the moot tagger, LEMMA is the modern lemma form determined for CANON_TEXT and POS_TAG by the TAGH morphological analyzer, and DETAILS if present are additional details.
This is the most compact of the text-based formats supported by CAB, but lacks flexibility.
- TT
Simple machine-readable "vertical" text format similar to that used by Corpus WorkBench. Each line is either a comment introduced by "%%", an empty line indicating a sentence boundary, or a TAB-separated token line. Token lines' initial column is the token surface text, and subsequent columns are are the token's attribute values, where each attribute value column begins with the attribute label enclosed in square brackets "[...]" and is followed by the attribute value as a text string.
Useful for further quick and dirty script-based processing.
- TJ
Simple machine-readable "vertical" text format based on the TT format but using JSON to encode sentence- and token-level attributes rather than an explicit attribute labelling scheme. Each line is either a comment introduced by "%%", an empty line indicating a sentence boundary, a document attribute line, a sentence attribute line, or a TAB-separated token line. Document-attribute lines are comments of the form "
JSON", where JSON is a JSON object representing auxilliary document attributes. Sentence-attribute lines are analogousd comments of the form "%%$TJ:SENT=
JSON". Token lines consist of the the token surface text, followed by a TAB character, followed by a JSON object representing the internal token structure.Useful for further script-based processing.
XML-based Formats
The CAB web-service supports a number of XML-based formats for data exchange. XML data formats are in general less efficient to parse and/or generate than text-based or data-oriented formats, but they do retain some degree of human-readability and the easy availability of XML processing software packages such as libxml or XMLStarlet makes such formats a reasonable candidate for archiving and cross-platform data sharing.
- XmlTokWrap
Simple serial XML-based format as used by the DTA::TokWrap module. Supports arbitrary token attribute substructure, but fairly slow.
- XmlTokWrapFast
Simple serial XML-based format as used by the DTA::TokWrap module. Faster than the XmlTokWrap formatter, but doesn't support all attributes.
- XmlLing
Flat XML-based format similar to "XmlTokWrapFast" but using only TEI att.linguistic attributes to represent token information. Faster than the XmlTokWrap formatter, but doesn't support all attributes.
[example input, example output]
Parses raw un-tokenized TEI-like input (with or without //c elements) using DTA::TokWrap to reserialize and tokenize the source text, and splices analysis results into the resulting XML document using the XmlTokWrap format. Any
elements in the input will be ignored and the input will be (re-)tokenized. Output data is itself parseable by by the TEIws formatter.Be warned that output sentence- and token-nodes (<s> and <w> elements, respectively) may be fragmented in the final output file. A "fragmented" node in this sense is a logical unit (sentence or token) realized in the output TEI-XML file as multiple elements. Fragmented nodes are encoded using the TEI "linking" attributes @prev and @next, and only the first element of a fragmented node should contain the CAB attribute substructure for that node.
Input to this class need not strictly conform to the TEI Guidelines; in fact, the only structural requirement is that at least one
element be present -- any input outside of the scope of a<text>
element is ignored. Input files must however be encoded in UTF-8. In particular, XML documents conforming to the DTABf Guidelines should be handled gracefully.Primarily useful for analyzing native TEI-like XML corpus data without losing structural information encoded in the source XML itself.
- TEI-fast
[example input, example output]
Wrapper for the "TEI" parser/formatter class using "XmlTokWrapFast" to format the low-level token data See "TEI" and "XmlTokWrapFast" for caveats.
- TEI-ling
[example input, example output]
Wrapper for the "TEI" parser/formatter class using "XmlLing" to format the low-level token data using only TEI att.linguistic attributes. See "TEI" and "XmlLing" for caveats.
- TEIws
[example input, example output]
High-level parser/formatter class for pre-tokenized (and possibly fragmented) TEI-like XML as output by the TEI formatter. Input files should be encoded in UTF-8, and every input sentence
and token//w
must have an@id
attribute.Potentially useful for analyzing pre-tokenized TEI-like XML data, but primarily used for converting to other, script-friendlier formats such as CSV.
- TEIws-ling
Wrapper for the "TEIws" parser/formatter class using "XmlLing" to format the low-level token data using only TEI att.linguistic attributes, which are also parsed from the input document if present. See "TEIws" and "XmlLing" for caveats.
[example input (untokenized), example input (pre-tokenized), example output]
Monolithic stand-off XML format used by CLARIN-D, in particlar by the WebLicht tool-chainer. See the TCF format documentation for details on the TCF format. CAB currently handles only the
, andorthography
TCF layers. - XML-RPC
Flexible but obscenely inefficient format used for data transfer by the XML-RPC Protocol. Avoid it if you can.
Data-oriented Formats
The following formats provide direct dumps of the underlying DTA::CAB::Document used internally by CAB itself. They are efficient to parse and to produce, but may not be suitable for direct human consumption.
Direct JSON dump of the underlying DTA::CAB::Document structure using the Perl JSON::XS module. Very fast and flexible, suitable for further automated processing.
Direct dump of the underlying DTA::CAB::Document structure as YAML markup. Fast, flexible, and supports shared substructures, unlike the JSON formatter.
- Perl
Direct dump of the underlying DTA::CAB::Document structure using the Perl Data::Dumper module. Mainly useful for further automated processing with Perl while retaining some degree of human readability.
- Storable
Direct binary dump of the underlying DTA::CAB::Document structure using the Perl Storable module. This is currently the fastest I/O class for both in- and output, mainly useful for further automated processing with Perl.
Advanced Usage
The CAB web-service is a request-oriented service: it accepts a user request as a set of parameter=value pairs and returns the analyzed data as a DTA::CAB::Document object encoded according to the output format specified by the ofmt
parameter. Parameters are passed to the DiaCollo web-service RESTfully via the URL query string or HTTP POST request as for a standard web form. The URL for the low-level request including all user parameters is displayed in the web front-end in the status line. See "query Requests" in DTA::CAB::HttpProtocol for more details on the RESTful CAB request protocol and a list of supported parameters.
Since CAB requests are really nothing more than standard HTTP form requests, a large variety of existing software packages can be used to generate and dispatch CAB requests, e.g LWP::UserAgent, curl, or wget. When automating CAB requsts, please respect the caveats mentioned above in the file query example.
Querying CAB with curl
To analyze a TEI-like XML file FILE.tei.xml using curl and save the analysis results to a "spliced" TEI file FILE.teiws.xml, the following command-line ought to suffice:
curl -X POST -sSF "qd=@FILE.tei.xml" -o "FILE.teiws.xml" ""
Alternative IO formats and request parameters can be accommodated by inserting them into the URL query string passed to curl
. You can also make use of the inline POSTDATA mechanism (a.k.a. "xpost") described in the DTA::CAB::HttpProtocol manpage in order to save yourself and the CAB server the effort of encoding/decoding the document data. In this case, you need to specify an appropriate "Content-Type
" header, e.g:
curl -X POST -sSH "Content-Type: text/tei+xml; charset=utf8" --data-binary "@FILE.tei.xml" -o "FILE.teiws.xml" ""
You might be interested in the and/or wrapper scripts, which encapsulate some of the common curl command-line options. The preceding 2 example curl
calls should be equivalent to:
bash "?fmt=tei" "FILE.xml" -o "FILE.teiws.xml"
bash "?fmt=tei" "FILE.xml" -o "FILE.teiws.xml"
Note that when accessing the CAB web-service API directly via HTTP in this fashion, auto-detection of input file format is not supported, so you must specify at least the "fmt" parameter in the URL query string if your files are not in the global default format (usually TT).
Post-processing TEI XML
The following XSL scripts are provided for post-processing the "spliced" TEI-like output format.
- spliced2ling.xsl
Removes native CAB-markup from TEI-like XML files, encoding the remaining token analysis information using the TEI
inventory. All tokens remain encoded as<w>
elements (rather than<pc>
elements), and only theatt.linguistic
, and@join
are inserted, so that e.g.<w id="w2" t="EJn"><moot word="Ein" lemma="ein" tag="ART"/>EJn</w> <w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/>Elephant</w><w>!</w>
<w id="w2" lemma="ein" pos="ART" norm="Ein">Ein</w> <w id="w3" join="right" lemma="Elefant" pos="NN" norm="Elefant">Elephant</w> <w join="left">!</w>
Note that the
attribute will be correctly generated only if the relevantw
elements are truly immediately adjacent in the input file: any intervening newlines or other whitespace will prohibit insertion of a@join
attribute. - spliced2norm.xsl
elements, replacing the surface text of each token with its CAB-normalized form, so that e.g.<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
becomes simply the text node
- spliced2orig+reg.xsl
elements, replacing non-identity normalizations with achoice
element containing a daughterorig
with the original surface form and areg[@resp="#cab"]
daughter containing the CAB-normalized form, so that e.g.<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
<choice><orig>Elephant</orig><reg resp="#cab">Elefant</reg></choice>
- spliced2cab.xsl
Resolves fragmented nodes and converts a "spliced" TEI-like XML file into a serial format as output by the XmlTokWrap formatter class.
- spliced2clean.xsl
Removes some extraneous CAB-markup from TEI-like XML files. Probably not really useful for files returned by the CAB web-service.
- spliced2cleaner.xsl
Removes most CAB-markup from TEI-like XML files, so that e.g.
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/>Elephant</w>
- spliced2clean+cabns.xsl
Removes most CAB-markup from TEI-like XML files and assigns CAB-internal attributes to the XML namespace "cab" (
), so that e.g.<w id="w3" t="Elephant"><moot word="Elefant" lemma="Elefant" tag="NN"/><xlit isLatin1="1" latin1Text="Elephant" isLatinExt="1"/>Elephant</w>
<w xml:id="w3" cab:t="Elephant" cab:word="Elefant" cab:tag="NN" cab:lemma="Elefant">Elephant</w>
Post-processing TCF XML
- tcf-orthswap.xsl
Swaps the text content of 1:1-corresponding
elements, so that e.g.<tokens> <token ID="w1">Ein</token> <token ID="w2">Elephant</token> </tokens> ... <orthography> <correction tokenIDs="w2" operation="replace">Elefant</correction> </orthography>
<tokens> <token ID="w1">Ein</token> <token ID="w2">Elefant</token> </tokens> ... <orthography> <correction tokenIDs="w2" operation="replace">Elephant</correction> </orthography>
Potentially useful for preparing CAB-annotated TCF data for submission to other text-sensitive TCF processors which themselves do not respect the
The author would appreciate CAB users citing its use in any related publications. As a general CAB-related reference, and for analysis and canonicalizaion of historical text to modern forms in particular, you can cite:
Jurish, B. Finite-state Canonicalization Techniques for Historical German. PhD thesis, Universität Potsdam, 2012 (defended 2011). URN urn:nbn:de:kobv:517-opus-55789, [epub, PDF, BibTeX]
For the concrete architecture of the CAB system as used by the Deutsches Textarchiv (DTA) project, you can cite:
Jurish, B. "Canonicalizing the Deutsches Textarchiv." In Proceedings of Perspektiven einer corpusbasierten historischen Linguistik und Philologie (Berlin, Germany, 12th-13th December 2011), volume 4 of Thesaurus Linguae Aegyptiae, Berlin-Brandenburgische Akademie der Wissenschaften, 2013. [epub, PDF, BibTeX]
For online term expansion with the "expand" analysis chain, you can cite:
Jurish, B., C. Thomas, & F. Wiegand. "Querying the Deutsches Textarchiv." In U. Kruschwitz, F. Hopfgartner, & C. Gurrin (editors), Proceedings of the Workshop MindTheGap 2014: Beyond Single-Shot Text Queries: Bridging the Gap(s) between Research Communities Berlin, Germany, 4th March, 2014, pages 25-30, 2014. [PDF, BibTeX]
The CAB software page is the top-level repository for CAB documentation, news, etc.
The DTA::CAB manual page contains a basic introduction to the the CAB architecture.
The DTA::CAB::Format manual page describes the abstract CAB I/O Format API, and includes a list of supported format classes.
The DTA::CAB::HttpProtocol manual page describes the conventions used by the CAB web-service API.
The DTA 'Base Format' Guidelines (DTABf) describes the subset of the TEI encoding guidelines which can reasonably be expected to be handled gracefully by the CAB TEI and/or TEIws formatters.
Jurish (2012) describes the abstract method used by CAB for canonicalizaion of historical text to modern forms.
Jurish (2013) describes the concrete architecture of the CAB system as used by the Deutsches Textarchiv project.
Jurish et al. (2014) describes the use of CAB's online term expansion chain for runtime evaluation of database queries.
Bryan Jurish <>
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1186:
You forgot a '=back' before '=head1'