NAME
DiaColloDB::Relation::TDF - collocation db, profiling relation: (term x document) raw-frequency matrix
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Relation::TDF;
##========================================================================
## Constructors etc.
$rel = CLASS_OR_OBJECT->new(%args);
##========================================================================
## TDF API: Utils
$vtype = $rel->vtype();
$itype = $rel->itype();
$packas = $rel->vpack();
$packas = $rel->ipack();
##========================================================================
## Persistent API: disk usage
@files = $rel->diskFiles();
##========================================================================
## Persistent API: header
@keys = $rel->headerKeys();
$hdr = $rel->headerData();
##========================================================================
## Relation API: open/close
$rel_or_undef = $rel->open($base);
$rel_or_undef = $rel->close();
$bool = $rel->opened();
##========================================================================
## Relation API: creation
$rel = $CLASS_OR_OBJECT->create($coldb,$tokdat_file,%opts);
$rel = CLASS_OR_OBJECT->union($coldb, \@dbargs, %opts);
##========================================================================
## Relation API: info
\%info = $rel->dbinfo($coldb);
##========================================================================
## Relation API: profiling
$mprf = $rel->profile($coldb, %opts);
$mprf = $rel->extend($coldb, %opts);
$mpdiff = $rel->compare($coldb, %opts);
##========================================================================
## Profile: Utils: PDL-based profiling
$mprf = $rel->vprofile($coldb, \%opts);
##========================================================================
## Profile: Utils: domain sizes
$NT = $rel->nTerms();
$ND = $rel->nDocs();
$NC = $rel->nFiles();
$NA = $rel->nAttrs();
$NM = $rel->nMeta();
##========================================================================
## Profile: Utils: attribute positioning
\%tpos = $rel->tpos();
\%mpos = $rel->mpos();
##========================================================================
## Profile: Utils: query parsing & evaluation
$idPdl = $rel->idpdl($idPdl);
$tupleIds = $rel->tupleIds($attrType, $attrName, $valIdsPdl);
$ti = $rel->termIds($tattrName, $valIdsPDL);
$ci = $rel->catIds($mattrName, $valIdsPDL);
$bool = $rel->hasMeta($attr);
$enum_or_undef = $rel->metaEnum($mattr);
$cats = $rel->catSubset($terms);
\%groupby = $rel->groupby($coldb, $groupby_request, %opts);
##========================================================================
## Relation API: default: query info
\%qinfo = $rel->qinfo($coldb, %opts);
DESCRIPTION
DiaColloDB::Relation::TDF is a DiaColloDB::Relation subclass for document-level co-occurrence frequencies using PDL to efficiently store and query a sparse underlying (term x document) frequency matrix via the PDL::CCS package.
Supports Boolean expressions over both term- and document-level conditions (the latter via DDC #has[ATTRIBUTE,VALUE]
or #has[ATTRIBUTE,/REGEX/]
syntax) as well as grouping via literal indexed term- and/or document-level attributes.
An earlier version of this module was implemented as DiaColloDB::Relation::Vsem
("vector-space distributional semantic index").
Globals & Constants
- Variable: @ISA
-
DiaColloDB::Relation::TDF inherits from DiaColloDB::Relation.
Constructors etc.
- new
-
$rel = CLASS_OR_OBJECT->new(%args);
%args, object structure:
##-- user options base => $basename, ##-- relation basename flags => $flags, ##-- i/o flags (default: 'r') mgood => $regex, ##-- positive filter regex for metadata attributes mbad => $regex, ##-- negative filter regex for metadata attributes submax => $submax, ##-- choke on requested tdm cross-subsets if dense subset size ($NT_sub * $ND_sub) > $submax; default=2**29 (512M) mquery => \%mquery, ##-- qinfo templates for meta-fields (default: textClass hack for genre): ($mattr=>$TEMPLATE, ...) ## ##-- logging options logvprofile => $level, ##-- log-level for vprofile() (default=undef:none) logio => $level, ##-- log-level for low-level I/O operations (default=undef:none) ## ##-- modelling options (formerly via DocClassify) minFreq => $fmin, ##-- minimum total term-frequency for model inclusion (default=undef:use $coldb->{tfmin}) minDocFreq => $dfmin, ##-- minimim "doc-frequency" (#/docs per term) for model inclusion (default=4) minDocSize => $dnmin, ##-- minimum doc size (#/tokens per doc) for model inclusion (default=4; formerly $coldb->{vbnmin}) maxDocSize => $dnmax, ##-- maximum doc size (#/tokens per doc) for model inclusion (default=inf; formerly $coldb->{vbnmax}) vtype => $vtype, ##-- PDL::Type for storing compiled values (default=float; auto-promoted if required) itype => $itype, ##-- PDL::Type for storing compiled integers (default=long) ## ##-- guts: aux: info N => $tdm0Total, ##-- total number of (doc,term) frequencies counted dbreak => $dbreak, ##-- inherited from $coldb on create() ## ##-- guts: aux: term-tuples ($NA:number of term-attributes, $NT:number of term-tuples) attrs => \@attrs, ##-- known term attributes tvals => $tvals, ##-- pdl($NA,$NT) : [$apos,$ti] => $avali_at_term_ti tsorti => $tsorti, ##-- pdl($NT,$NA) : [,($apos)] => $tvals->slice("($apos),")->qsorti tpos => \%a2pos, ##-- term-attribute positions: $apos=$a2pos{$aname} ## ##-- guts: aux: metadata ($NM:number of metas-attributes, $NC:number of cats (source files)) meta => \@mattrs ##-- known metadata attributes meta_e_${ATTR} => $enum, ##-- metadata-attribute enum mvals => $mvals, ##-- pdl($NM,$NC) : [$mpos,$ci] => $mvali_at_ci msorti => $msorti, ##-- pdl($NC,$NM) : [,($mpos)] => $mvals->slice("($mpos),")->qsorti mpos => \%m2pos, ##-- meta-attribute positions: $mpos=$m2pos{$mattr} ## ##-- guts: model (formerly via DocClassify dcmap=>$dcmap) tdm => $tdm, ##-- term-doc matrix : PDL::CCS::Nd ($NT,$ND): [$ti,$di] -> f($ti,$di) tym => $tym, ##-- term-year matrix: PDL::CCS::Nd ($NT,$NY): [$ti,$yi] -> f($ti,$yi) cf => $cf_pdl, ##-- cat-freq pdl: dense: ($NC) : [$ci] -> f($ci) c2date => $c2date, ##-- cat-dates : dense ($NC) : [$ci] -> $date c2d => $c2d, ##-- cat->doc map: dense (2,$NC) : [*,$ci] -> [$di_off,$di_len] d2c => $d2c, ##-- doc->cat map: dense ($ND) : [$di] -> $ci #...
TDF API: Utils
- vtype
-
$vtype = $rel->vtype();
get PDL::Type value type for storing compiled values.
- itype
-
$itype = $rel->itype();
get PDL::Type integer type for storing compiled indices.
- vpack
-
$packas = $rel->vpack();
pack-template for $rel->vtype(), e.g. "f*"
- ipack
-
$packas = $rel->ipack();
pack-template for $rel->itype(), e.g. "l*"
Persistent API: disk usage
- diskFiles
-
@files = $rel->diskFiles();
returns disk storage files, used by du() and timestamp()
Persistent API: header
- headerKeys
-
@keys = $rel->headerKeys();
keys to save as header; override includes qw(meta attrs vtype itype) and excludes logging and i/o keys.
- headerData
-
$hdr = $rel->headerData();
returns reference to object header data; override stringifies {itype} and {vtype} keys.
Relation API: open/close
- open
-
$rel_or_undef = $rel->open($base); $rel_or_undef = $rel->open($base,$flags); $rel_or_undef = $rel->open();
Opens underlying index files.
- close
-
$rel_or_undef = $rel->close();
Closes underlying index files.
- opened
-
$bool = $rel->opened();
Returns true iff index is opened. Really just checks for
$rel->{tdm}
.
Relation API: creation
- create
-
$rel = $CLASS_OR_OBJECT->create($coldb,$tokdat_file,%opts);
Populates relation index for $coldb. Requires:
(temporary, tied) doc-arrays @$coldb{qw(docmeta docoff)}
temp file "$coldb->{dbdir}/vtokens.bin": pack($coldb->{pack_w}, @wattrs)
OR
wdmfile=>$wdmfile option
%opts: clobber %$rel, also:
docmeta =>\@docmeta, ##-- for union(): override $coldb->{docmeta} ## $docmeta[$ci] = {id=>$id, nsigs=>$nsigs, file=>$rawfile, date=>$date, label=>$label, meta=>\%meta} wdmfile =>$wdmfile, ##-- for union(): txt ~ "$ai0 $ai1 ... $aiN $doci $f"; default is generated from 'vtokens.bin' ivalmax =>$imax, ##-- for union(): maximum integer value (for auto-promotion) reusedir=>$bool, ##-- for union(): set to true if we're running in a "clean" directory logas =>$logas, ##-- log label (default: 'create()')
- union
-
$rel = CLASS_OR_OBJECT->union($coldb, \@dbargs, %opts);
merge multiple tdf indices into new object. \@dbargs is an ARRAY-ref of DiaColloDB sub-objects ($coldb,...) containing {tdf} relations to be merged.
%opts: clobber %$rel
Current implementation just creates temp-files utdm0.dat and udocmeta.tmp and then calls create().
Relation API: info
- dbinfo
-
\%info = $rel->dbinfo($coldb);
embedded info-hash for $coldb->dbinfo()
Relation API: profiling
- profile
-
$mprf = $rel->profile($coldb, %opts);
Get a relation profile for selected items as a DiaColloDB::Profile::Multi object. %opts are as for DiaColloDB::Relation::profile(). Really just a wrapper for the vprofile() method.
- extend
-
Get independent f2 frequencies for
$opts{slice2keys}
as a DiaColloDB::Profile::Multi object. - compare
-
$mpdiff = $rel->compare($coldb, %opts);
Get a relation comparison profile for selected items as a DiaColloDB::Profile::MultiDiff object. %opts are as for DiaColloDB::Relation::compare(), which this method calls after parsing the
groupby
option via $rel->groupby($coldb, $opts{groupby}, relax=>0).
Profile: Utils: PDL-based profiling
- vprofile
-
\@pprfs = $rel->vprofile($coldb, \%opts);
Guts for the profile() method. User options in %opts are as for DiaColloDB::Relation::profile(). Additional keys are populated and used in the course of the computation (so don't set them):
vq => $vq, ##-- parsed query, DiaColloDB::Relation::TDF::Query object groubpy => \%groupby, ##-- as returned by $rel->groupby($coldb, \%opts) dlo => $dlo, ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1); dhi => $dhi, ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1); dslo => $dslo, ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1); dshi => $dshi, ##-- as returned by $coldb->parseDateRequest(@opts{qw(date slice fill)},1);
Profile: Utils: domain sizes
- nTerms
-
$NT = $rel->nTerms();
returns number of indexed terms.
- nDocs
-
$ND = $rel->nDocs();
returns number of indexed documents (breaks).
- nFiles
-
$NC = $rel->nFiles();
returns number of indexed categories (original source files).
- nAttrs
-
$NA = $rel->nAttrs();
returns number of indexed term-attributes.
- nMeta
-
$NM = $rel->nMeta();
returns number of indexed meta-attributes.
Profile: Utils: attribute positioning
- tpos
-
\%tpos = $rel->tpos(); $tpos = $rel->tpos($tattr);
In the first form, get or build the term-attribute position lookup hash. In the second form, get the index position along dimension $NA of the term-attribute named
$tattr
, or undef if$tattr
is not a known term attribute. - mpos
-
\%mpos = $rel->mpos(); $mpos = $rel->mpos($mattr);
In the first form, get or build the meta-attribute position lookup hash. In the second form, get the index position along dimension $NM of the meta-attribute named
$mattr
, or undef if$mattr
is not a known metadata attribute.
Profile: Utils: query parsing & evaluation
- idpdl
-
$idPdl = $rel->idpdl($idPdl); $idPdl = $rel->idpdl(\@ids); $idPdl = $rel->idpdl($id);
Ensure PDL-ness of a set of integer IDs.
- tupleIds
-
$tupleIds = $rel->tupleIds($attrType, $attrName, $valIds);
Returns a PDL representing the set of index items of type
$attrType
whose value for the$attrName
attribute is contained in the ID-set$valIds
, which may be specified in any of the forms accepted by the idpdl() method.$attrType
is either 't' for a term-attribute (in which case the returned$tupleIds
are term indices), or 'm' for a metadata attribute (in which case the returned$tupleIds
are "category" indices). The returned$tupleIds
are always sorted in ascending order.Could use some optimization.
- termIds
-
$ti = $rel->termIds($tattrName, $valIds);
- catIds
-
$ci = $rel->catIds($mattrName, $valIds);
- hasMeta
-
$bool = $rel->hasMeta($mattr);
returns true iff $rel supports metadata attribute $mattr.
- metaEnum
-
$enum_or_undef = $rel->metaEnum($mattr);
returns metadata attribute enum for $attr, or undef if $mattr is not supported.
- catSubset
-
$cats = $rel->catSubset($termIds); $cats = $rel->catSubset($termIds,$catIds)
Get a (sorted) cat-subset for the (sorted) term-set $termIds: the set of all "categories" (original source files) which contain at least one instance of any of the terms in $termIds, optionally restricted to the (sorted and unique) set $catIds. The returned category-IDs are sorted and unique.
- groupby
-
\%groupby = $rel->groupby($coldb, $groupby_request, %opts); \%groupby = $rel->groupby($coldb, \%groupby, %opts);
Modified version of DiaColloDB::groupby() suitable for pdl-ized TDF relation. $grouby_request is as for DiaColloDB::parseRequest(). Returns a HASH-ref:
##-- COMPAT: equivalent to DiaColloDB::groupby() return values req => $request, ##-- save request areqs => \@areqs, ##-- parsed attribute requests ([$attr,$ahaving, \%ainfo],...) ## + new: %ainfo = ( aname=>$enum_name, atype=>$t_or_m, apos=>$apos ) attrs => \@attrs, ##-- like $coldb->attrs($groupby_request), modulo "having" parts titles => \@titles, ##-- like map {$coldb->attrTitle($_)} @attrs ## ##-- NEW: for DiaColloDB::Relation::TDF how => $ghow, ##-- one of 't':groupby terms-only, 'c':groupby cats-only, 'tc':groupby terms+cats gatype => $gatype, ##-- pdl ($NG) : attribute types $ai : 0 if $areqs->[$ai] is a term attribute, 1 if meta-attribute gapos => $gapos, ##-- pdl ($NG) : term- or meta-attribute position indices $ai : $rel->mpos($attrs[$ai]) or $rel->tpos($attrs[$ai]) ghavingt => $ghavingt, ##-- pdl ($NHavingTOk) : term indices $ti s.t. $ti matches groupby "having" requests, or undef ghavingc => $ghavingc, ##-- pdl ($NHavingCOk) : cat indices $ci s.t. $ci matches groupby "having" requests, or undef g2s => \&g2s, ##-- stringification object suitable for DiaColloDB::Profile::stringify() [CODE,enum, or undef] gpack => $packas, ##-- pack template for groupby-keys
%opts:
warn => $level, ##-- log-level for unknown attributes (default: 'warn') relax => $bool, ##-- allow unsupported attributes (default=0)
Relation API: default: query info
- qinfo
-
\%qinfo = $rel->qinfo($coldb, %opts);
get query-info hash for profile administrivia (ddc hit links). %opts: as for profile() method. returned hash \%qinfo should have keys:
fcoef => $fcoef, ##-- frequency coefficient (constant 1 here) qtemplate => $qtemplate, ##-- query template with __W1.I1__ rsp __W2.I2__ replacing groupby fields
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2020 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
SEE ALSO
DiaColloDB::Relation(3pm), DiaColloDB::Relation::TDF::Query(3pm), DiaColloDB::Relation::Cofreqs(3pm), DiaColloDB::Relation::Unigrams(3pm), DiaColloDB::Relation::DDC(3pm), DiaColloDB(3pm), perl(1), ...