NAME
DiaColloDB::Document::TEI - diachronic collocation db, source document, TEI format
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Document::TEI;
##========================================================================
## Constructors etc.
$doc = CLASS_OR_OBJECT->new(%args);
##========================================================================
## API: I/O: parse
$bool = $doc->fromFile($filename_or_fh, %opts);
DESCRIPTION
DiaColloDB::Document::TEI provides a DiaColloDB::Document-compliant API for rudimentary parsing of corpus files in a TEI-like XML format. Input files must be pre-tokenized and represent tokens as <w>
elements. Input files may also optionally encode sentence- and/or paragraph-boundaries using <s>
and <p>
elements, respectively. Fragmented nodes and TEI "linking" attributes are not supported. Although fairly flexible, this document parsing class is very slow and inefficient and is not recommended for production use.
Globals & Constants
- Variable: @ISA
-
DiaColloDB::Document::TEI inherits from DiaColloDB::Document and supports the DiaColloDB::Document API.
Constructors etc.
- new
-
$doc = CLASS_OR_OBJECT->new(%args);
%args, object structure:
##-- parsing options tei_ns_$NS => $uri, ##-- register namespace URI prefix $NS for user-defined XPaths tei_date => $xpath, ##-- XPath for parsing document date, relative to document root tei_meta_$ATTR => $xpath, ##-- XPath for parsing meta-attribute $ATTR, relative to document root tei_word_$ATTR => $xpath, ##-- XPath for parsing token-attribute $ATTR, relative to //w element tei_break_$BRK => $xpath, ##-- XPath for parsing break nodes, relative to document root tei_eos => $break, ##-- default break-level (default: 's') ## ##-- document data date =>$date, ##-- year tokens =>\@tokens, ##-- tokens, including undef for EOS meta =>\%meta, ##-- document metadata (e.g. author, title, collection, ...)
Each token in @tokens is a HASH-ref {w=>$word,p=>$pos,l=>$lemma,...}, or undef for EOS.
Default options:
##-- Namespaces tei_ns_tei => "http://www.tei-c.org/ns/1.0", ## ##-- Metadata XPaths tei_date => 'teiHeader/fileDesc/publicationStmt/date' tei_meta_title => 'teiHeader/fileDesc/titleStmt/title', tei_meta_author => 'teiHeader/fileDesc/titleStmt/author', tei_meta_textClass => 'teiHeader/fileDesc/profileDesc/textClass/classCode', ## ##-- Token Attribute XPaths tei_word_w => 'text()', tei_word_l => '@lemma', tei_word_p => '@type', ## ##-- Break-Level XPaths tei_break_s => '//text//s', tei_break_p => '//text//p', tei_break_div => '//text//div', tei_break_page => '//text//pb',
API: I/O: parse
- fromFile
-
$bool = $doc->fromFile($filename_or_fh, %opts);
parse tokens from $filename_or_fh. %opts: clobbers %$doc.
EXAMPLE
The following is an example file in the format accepted by this module:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>test document</title>
<author>Jurish, Bryan</author>
</titleStmt>
<publicationStmt>
<date>2016-02-25</date>
</publicationStmt>
<profileDesc>
<textClass>
<classCode>dummy</classCode>
<classCode>test-data</classCode>
</textClass>
</profileDesc>
</fileDesc>
</teiHeader>
<text>
<p>
<s>
<w type="DT" lemma="this">This</w>
<w type="VBZ" lemma="be">is</w>
<w type="DT" lemma="a">a</w>
<w type="NN" lemma="test">test</w>
<w type="SENT" lemma=".">.</w>
</s>
<s>
<w type="DT" lemma="this">This</w>
<w type="VBZ" lemma="be">is</w>
<w type="RB" lemma="only">only</w>
<w type="DT" lemma="a">a</w>
<w type="NN" lemma="test">test</w>
<w type="SENT" lemma=".">.</w>
</s>
</p>
<p>
<s>
<w type="DT" lemma="this">This</w>
<w type="VBZ" lemma="be">is</w>
<w type="RB" lemma="still">still</w>
<w type="DT" lemma="a">a</w>
<w type="NN" lemma="test">test</w>
<w type="SENT" lemma=".">.</w>
</s>
</p>
</text>
</TEI>
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2016 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.