NAME
DTA::CAB::Chain::DTA - Deutsches Textarchiv canonicalization chain class
SYNOPSIS
use
DTA::CAB::Chain::DTA;
##========================================================================
## Methods
$obj
= CLASS_OR_OBJ->new(
%args
);
$ach
=
$ach
->setupChains();
$bool
=
$ach
->ensureLoaded();
$bool
=
$anl
->doAnalyze(\
%opts
,
$name
);
$doc
=
$ach
->analyzeClean(
$doc
,\
%opts
);
DESCRIPTION
DTA::CAB::Chain::DTA is the DTA::CAB::Analyzer subclass implementing the robust orthographic canonicalization cascade used in the Deutsches Textarchiv project. This class inherits from DTA::CAB::Chain::Multi. See the "setupChains" method for a list of supported sub-chains and the corresponding analyers.
Methods
- new
-
$obj
= CLASS_OR_OBJ->new(
%args
);
%$obj, %args:
##-- paranoia
autoClean
=> 0,
##-- always run 'clean' analyzer regardless of options; checked in both doAnalyze(), analyzeClean()
defaultChain
=>
'default'
,
##
##-- overrides
chains
=>
undef
,
##-- see setupChains() method
chain
=>
undef
,
##-- see setupChains() method
Additionally, the following sub-analyzers are defined as fields of %$obj:
- tokpp
-
Token preprocessor, a DTA::CAB::Analyzer::TokPP object.
- xlit
-
Transliterator, a DTA::CAB::Analyzer::Unicruft object.
- lts
-
Phonetizer (Letter-to-Sound mapper), a DTA::CAB::Analyzer::LTS object.
- morph
-
Morphological analyzer (TAGH), a DTA::CAB::Analyzer::Morph object.
- mlatin
-
Latin pseudo-morphology, a DTA::CAB::Analyzer::Morph::Latin object.
- msafe
-
Morphological security heuristics, a DTA::CAB::Analyzer::MorphSafe object.
- rw
-
Weighted finite-state rewrite cascade, a DTA::CAB::Analyzer::Rewrite object.
Date-optimized variants
rw.1600-1700
,rw.1700-1800
, andrw.1800-1900
may also be included. - rwsub
-
Post-processing for rewrite cascade, a DTA::CAB::Analyzer::RewriteSub object.
- eqphox
-
Intensional (TAGH-based) phonetic equivalence expander, a DTA::CAB::Analyzer::EqPhoX object.
- eqpho
-
Extensional (corpus-based) phonetic equivalence expander, a DTA::CAB::Analyzer::EqPho object.
- eqrw
-
Extensional rewrite-equivalence expander, a DTA::CAB::Analyzer::EqRW object.
- dmoot
-
Token-level dynamic HMM conflation disambiguator, a DTA::CAB::Analyzer::Moot::DynLex object.
- dmootsub
-
Post-processing for "dmoot" analyzer, a DTA::CAB::Analyzer::DmootSub object.
- moot
-
HMM part-of-speech tagger, a DTA::CAB::Analyzer::Moot object.
- mootsub
-
Post-processing for "moot" tagger, a DTA::CAB::Analyzer::MootSub object.
- eqlemma
-
Extensional (corpus-based) lemma-equivalence class expander, a DTA::CAB::Analyzer::EqLemma object.
- clean
-
Janitor (paranoid removal of internal temporary data), a DTA::CAB::Analyzer::DTAClean object.
- setupChains
-
$ach
=
$ach
->setupChains();
Setup default named sub-chains in $ach->{chains}. Currently defines a singleton chain
sub.NAME
for each analyzer key in keys(%$ach), as well as the following non-trivial chains:'sub.expand'
=>[
@$ach
{
qw(eqpho eqrw eqlemma)
}],
'sub.sent'
=>[
@$ach
{
qw(dmoot dmootsub moot mootsub)
}],
'sub.sent1'
=>[
@$ach
{
qw(dmoot1 dmootsub moot1 mootsub)
}],
'sub.gn'
=>[
@$ach
{
qw(gn-syn gn-isa gn-asi)
}],
'sub.ot'
=>[
@$ach
{
qw(ot-syn ot-isa ot-asi)
}],
##
'default.static'
=>[
@$ach
{
qw(static)
}],
'default.exlex'
=>[
@$ach
{
qw(exlex)
}],
'default.tokpp'
=>[
@$ach
{
qw(tokpp)
}],
'default.xlit'
=>[
@$ach
{
qw(xlit)
}],
'default.lts'
=>[
@$ach
{
qw(xlit lts)
}],
'default.eqphox'
=>[
@$ach
{
qw(tokpp xlit lts eqphox)
}],
'default.morph'
=>[
@$ach
{
qw(tokpp xlit morph)
}],
'default.mlatin'
=>[
@$ach
{
qw(tokpp xlit mlatin)
}],
'default.msafe'
=>[
@$ach
{
qw(tokpp xlit morph mlatin msafe)
}],
'default.langid'
=>[
@$ach
{
qw(tokpp xlit morph mlatin msafe langid)
}],
'default.rw'
=>[
@$ach
{
qw(tokpp xlit rw)
}],
'default.rw.safe'
=>[
@$ach
{
qw(tokpp xlit morph mlatin msafe langid rw)
}],
'default.dmoot'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot)
}],
'default.dmoot1'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1)
}],
'default.moot'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot dmootsub moot)
}],
'default.moot1'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1 dmootsub moot1)
}],
'default.lemma'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1 dmootsub moot mootsub)
}],
'default.lemma1'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot1 dmootsub moot1 mootsub)
}],
'default.ner'
=>[
@$ach
{
qw(tokpp xlit lts eqphox morph mlatin msafe langid rw dmoot dmootsub moot mootsub ner)
}],
'default.base'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid)
}],
'default.type'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw rwsub)
}],
##
'expand.old'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqpho eqrw)
}],
'expand.ext'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqpho eqrw eqphox)
}],
'expand.all'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqpho eqrw eqphox dmoot1 dmootsub moot1 mootsub eqlemma)
}],
'expand.eqpho'
=>[
@$ach
{
qw(static exlex xlit lts eqpho)
}],
'expand.eqrw'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqrw)
}],
'expand.eqlemma'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub eqlemma)
}],
'expand.gn-syn'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-syn)
}],
'expand.gn-isa'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-isa)
}],
'expand.gn-asi'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-asi)
}],
'expand.gn'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub gn-syn gn-isa gn-asi)
}],
'expand.ot-syn'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-syn)
}],
'expand.ot-isa'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-isa)
}],
'expand.ot-asi'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-asi)
}],
'expand.ot'
=>[
@$ach
{
qw(static exlex xlit lts morph mlatin msafe rw eqphox dmoot1 dmootsub moot1 mootsub ot-syn ot-isa ot-asi)
}],
##
'norm'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot dmootsub moot mootsub)
}],
'norm1'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot1 dmootsub moot1 mootsub)
}],
'ner'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot dmootsub moot mootsub ner)
}],
'caberr'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot dmootsub moot mootsub mapclass)
}],
'caberr1'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw eqphox dmoot1 dmootsub moot1 mootsub mapclass)
}],
'all'
=>[
@$ach
{
qw(static exlex tokpp xlit lts morph mlatin msafe langid rw rwsub eqpho eqrw eqphox dmoot dmootsub moot mootsub eqlemma)
}],
'clean'
=>[
@$ach
{
qw(clean)
}],
##
'null'
=>[
$ach
->{null}],
High-level date-optimized chains
norm.RNG
,norm1.RNG
,lemma.RNG
,lemma1.RNG
,default.RNG
, andexpand.RNG
are also defined using the date-optimized rewrite cascaderw.RNG
in place of the default "generic" cascaderw
for each range RNG in1600-1700
,1700-1800
, and1800-1900
. - ensureLoaded
-
$bool
=
$ach
->ensureLoaded();
Ensures analysis data is loaded from default files. Inherited DTA::CAB::Chain::Multi override calls ensureChain() before inherited method. Hack copies chain sub-analyzers (rwsub, dmootsub) AFTER loading their own sub-analyzers, setting 'enabled' only then if appropriate.
- doAnalyze
-
$bool
=
$anl
->doAnalyze(\
%opts
,
$name
);
Alias for $anl->can("analyze${name}") && (!exists($opts{"doAnalyze${name}"}) || $opts{"doAnalyze${name}"}). Override checks $anl->{autoClean} flag.
- analyzeClean
-
$doc
=
$ach
->analyzeClean(
$doc
,\
%opts
);
Cleanup any temporary data associated with $doc. Chain default calls $a->analyzeClean for each analyzer $a in the chain, then superclass Analyzer->analyzeClean. Local override checks $ach->{autoClean}.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2010-2019 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.24.1 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
dta-cab-analyze.perl(1), DTA::CAB::Chain::Multi(3pm), DTA::CAB::Chain(3pm), DTA::CAB::Analyzer(3pm), DTA::CAB(3pm), perl(1), ...
3 POD Errors
The following errors were encountered while parsing the POD:
- Around line 514:
L<> starts or ends with whitespace
- Around line 539:
L<> starts or ends with whitespace
- Around line 552:
'=item' outside of any '=over'