NAME

DiaColloDB::Corpus::Filters - collocation db, source corpus content filters

SYNOPSIS

##========================================================================
## PRELIMINARIES

use DiaColloDB::Corpus::Filters;

##========================================================================
## Methods

$filters = $CLASS_OR_OBJECT->new(%opts);
$filters = $CLASS_OR_OBJECT->null();
$filters = $filters->clear();
$bool = $filters1->equal($filters2);
\%name2obj = $filters->compile();
\%line2undef = $coldb->loadListFile($filename_or_undef);

DESCRIPTION

DiaColloDB::Corpus::Filters is a class representing corpus content filters (e.g. stopword lists and regular expressions) used by DiaColloDB::Corpus::Compiled and implicitly by the DiaColloDB::create()|DiaColloDB/create method as called by the top-level command-line utility dcdb-corpus-create.perl(1).

Administrivia

Variable: @ISA

DiaColloDB::Corpus::Filters inherits from DiaColloDB::Persistent. It also uses Exporter for compatibility with older versions of the DiaColloDB distribution in which the package-global default variables resided directly in the DiaColloDB package itself.

Defaults

(formerly defined in DiaColloDB.pm)

Don't use qr// for regex defaults, because Storable doesn't like pre-compiled Regexps.

Variable: $PGOOD_DEFAULT

Default positive PoS-regex for document parsing. Default = q/^(?:N|TRUNC|VV|ADJ)/.

Variable: $PBAD_DEFAULT

Default negative PoS-regex for document parsing. Default = undef (none).

Variable: $WGOOD_DEFAULT

Default positive word regex for document parsing. Default = q/[[:alpha:]]/

Variable: $WBAD_DEFAULT

Default negative word regex for document parsing. Default = q/[\.]/.

Variable: $LGOOD_DEFAULT

Default positive lemma regex for document parsing. Default = undef (none).

Variable: $LBAD_DEFAULT

Default negative lemma regex for document parsing. Default = undef (none).

Methods

new
$filters = $CLASS_OR_OBJECT->new(%opts);

Returns a new DiaColloDB::Corpus::Filters object, which is a simple HASH-ref wrapping %opts:

##-- part-of-speech filters
pgood     => $re,    ##-- PoS whitelist regex
pgoodfile => $file,  ##-- PoS whitelist filename
pbad      => $re,    ##-- PoS blacklist regex
pbadfile  => $file,  ##-- PoS blacklist filename

##-- word surface text filters
wgood     => $re,    ##-- word whitelist regex
wgoodfile => $file,  ##-- word whitelist filename
wbad      => $re,    ##-- word blacklist regex
wbadfile  => $file,  ##-- word blacklkist filename (= "stopword list")

##-- lemma filters
lgood     => $re,    ##-- lemma whitelist regex
lgoodfile => $file,  ##-- lemma whitelist filename
lbad      => $re,    ##-- lemma blacklist regex
lbadfile  => $file,  ##-- lemma blacklist filename

See "Defaults" for the default values.

null
$filters = $CLASS_OR_OBJECT->null();

Returns a new DiaColloDB::Corpus::Filters object representing a "null-filter", i.e. with all filter properties undefined.

clear
$filters = $filters->clear();

Deletes all filter properties (white- and blacklist regexes and filenames) from the $filters object.

isnull
$bool = $filters->isnull();

Returns true iff $filters does not define any supported filter properties at all (i.e. application of $filters would be a no-op).

equal
$bool = $filters1->equal($filters2);
$bool = $CLASS->equal($filters1,$filters2)

Returns true iff filter object operands define the all and only the same supported filter properties with identical values.

compile
\%name2obj = $filters->compile();
\%name2obj = $CLASS->compile(\%filters);

Returns a HASH-ref of compiled filter regexes and (stop|go)-hashes of the form

${NAME}     => $REGEXP,
${NAME}file => \%HASHREF,
loadListFile
\%line2undef = $coldb->loadListFile($filename_or_undef);

Low-level utility method used to load (stop|go)-list files.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2020 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

dcdb-corpus-compile.per(1), DiaColloDB::Corpus::Compiled(3pm), DiaColloDB(3pm), perl(1), ...