NAME
DiaColloDB::Corpus::Filters - collocation db, source corpus content filters
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Corpus::Filters;
##========================================================================
## Methods
$filters = $CLASS_OR_OBJECT->new(%opts);
$filters = $CLASS_OR_OBJECT->null();
$filters = $filters->clear();
$bool = $filters1->equal($filters2);
\%name2obj = $filters->compile();
\%line2undef = $coldb->loadListFile($filename_or_undef);
DESCRIPTION
DiaColloDB::Corpus::Filters is a class representing corpus content filters (e.g. stopword lists and regular expressions) used by DiaColloDB::Corpus::Compiled and implicitly by the DiaColloDB::create()|DiaColloDB/create
method as called by the top-level command-line utility dcdb-corpus-create.perl(1).
Administrivia
- Variable: @ISA
-
DiaColloDB::Corpus::Filters inherits from DiaColloDB::Persistent. It also uses Exporter for compatibility with older versions of the DiaColloDB distribution in which the package-global default variables resided directly in the DiaColloDB package itself.
Defaults
(formerly defined in DiaColloDB.pm)
Don't use qr//
for regex defaults, because Storable doesn't like pre-compiled Regexps.
- Variable: $PGOOD_DEFAULT
-
Default positive PoS-regex for document parsing. Default =
q/^(?:N|TRUNC|VV|ADJ)/
. - Variable: $PBAD_DEFAULT
-
Default negative PoS-regex for document parsing. Default = undef (none).
- Variable: $WGOOD_DEFAULT
-
Default positive word regex for document parsing. Default =
q/[[:alpha:]]/
- Variable: $WBAD_DEFAULT
-
Default negative word regex for document parsing. Default =
q/[\.]/
. - Variable: $LGOOD_DEFAULT
-
Default positive lemma regex for document parsing. Default = undef (none).
- Variable: $LBAD_DEFAULT
-
Default negative lemma regex for document parsing. Default = undef (none).
Methods
- new
-
$filters = $CLASS_OR_OBJECT->new(%opts);
Returns a new DiaColloDB::Corpus::Filters object, which is a simple HASH-ref wrapping
%opts
:##-- part-of-speech filters pgood => $re, ##-- PoS whitelist regex pgoodfile => $file, ##-- PoS whitelist filename pbad => $re, ##-- PoS blacklist regex pbadfile => $file, ##-- PoS blacklist filename ##-- word surface text filters wgood => $re, ##-- word whitelist regex wgoodfile => $file, ##-- word whitelist filename wbad => $re, ##-- word blacklist regex wbadfile => $file, ##-- word blacklkist filename (= "stopword list") ##-- lemma filters lgood => $re, ##-- lemma whitelist regex lgoodfile => $file, ##-- lemma whitelist filename lbad => $re, ##-- lemma blacklist regex lbadfile => $file, ##-- lemma blacklist filename
See "Defaults" for the default values.
- null
-
$filters = $CLASS_OR_OBJECT->null();
Returns a new DiaColloDB::Corpus::Filters object representing a "null-filter", i.e. with all filter properties undefined.
- clear
-
$filters = $filters->clear();
Deletes all filter properties (white- and blacklist regexes and filenames) from the
$filters
object. - isnull
-
$bool = $filters->isnull();
Returns true iff $filters does not define any supported filter properties at all (i.e. application of $filters would be a no-op).
- equal
-
$bool = $filters1->equal($filters2); $bool = $CLASS->equal($filters1,$filters2)
Returns true iff filter object operands define the all and only the same supported filter properties with identical values.
- compile
-
\%name2obj = $filters->compile(); \%name2obj = $CLASS->compile(\%filters);
Returns a HASH-ref of compiled filter regexes and (stop|go)-hashes of the form
${NAME} => $REGEXP, ${NAME}file => \%HASHREF,
- loadListFile
-
\%line2undef = $coldb->loadListFile($filename_or_undef);
Low-level utility method used to load (stop|go)-list files.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
dcdb-corpus-compile.per(1), DiaColloDB::Corpus::Compiled(3pm), DiaColloDB(3pm), perl(1), ...