NAME
DiaColloDB::Client::list - diachronic collocation db: client: distributed
DESCRIPTION
DiaColloDB::Client::list is a subclass of DiaColloDB::Client for accessing a set of distributed DiaColloDB databases via a list://
URL whose path part is a space- or colon--separated list of sub-URLs supported by DiaColloDB::Client. It supports the DiaColloDB::Client API by calling the relevant methods on each of its sub-clients.
new() options and object structure:
##-- DiaColloDB::Client: options
url => $url, ##-- list url (sub-urls, separated by whitespace, "+SCHEME://", or "+://")
##
##-- DiaColloDB::Client::list
urls => \@urls, ##-- sub-urls
opts => \%opts, ##-- sub-client options
fudge => $fudge, ##-- get ($fudge*$kbest) items from sub-clients (0:all; default=10)
logFudge => $level, ##-- log-level for fudge-coefficient debugging (default='debug')
##
##-- guts
clis => \@clis, ##-- per-url clients
The most important client parameter is the fudge-coefficient option fudge=>$fudge
, which requests that up to $fudge*$kbest
items be retrieved from sub-clients for each profile() call. If $fudge <= 0
, all collocates will be retrieved from each sub-client, and trimming will be performed exclusively by the superordinate DiaColloDB::Client::list object. The default value of 10 should return reasonable results without too large of a performance penalty in most cases, but be aware that the results may not be strictly correct due to sub-client local pruning; see for details.
List URLs
List URLs passed as the the url
option to the constructor can be either ARRAY-refs of sub-URLs or simple strings with an optional list://
scheme. In the latter case, sub-URLs in the argument string are separated by whitespace or by a plus character ("+") followed by the sub-URL scheme, e.g.:
["file://a","file://b"] ##-- ARRAY-ref of explicit file URLs
["a" , "b" ] ##-- ARRAY-ref of implicit file URLs
"list://file://a file://b" ##-- string with space-separated explicit file URLs
"list://a b" ##-- string with space-separated implicit file URLs
"list://file://a+file://b" ##-- list with "+"-separated explicit file URLs
"list://a+://b" ##-- list with "+"-separated implicit file URLs
Options can be passed to the appropriate sub-URLs via those URLs' query strings, as described in "open" in DiaColloDB::Client. Options to the DiaColloDB::Client::list object itself can be passed in by using a sub-URL consisting of a HASH-ref or only a query string, e.g.:
["a","b",{fudge=>0}] ##-- ARRAY-ref with local options as HASH-ref
["a","b","?fudge=0"] ##-- ARRAY-ref with local options as query-string
"list://a b ?fudge=0" ##-- space-sparated string with local options
"list://a+://b+://?fudge=0" ##-- "+"-separated string with local options
KNOWN BUGS
Incorrect Independent Collocate Frequencies
The evaluation strategy currently used by this package is not strictly correct, even when $fudge==0
. Although the reported join frequencies f12 ought to be correct in this case, it can easily happen that the independent collocate frequencies f2 get mis-reported, leading to incorrect computations of f2-sensitive association scores such as mi
(pointwise mutual information * log-frequency product), ll
(log likelihood), or the default ld
(log Dice). Such errors occur whenever the list client accesses multiple sub-clients (e.g. $a
and $b
) and a candidate collocate $v
occurs in both of the subcorpora, but only occurs together with the target term $w
in one of the sub-clients' indices.
Suppose $v
occurs in subcorpus $a
with frequency f_a($v)
and in subcorpus $b
with frequency f_b($v)
, but only occurs together with $w
in subcorpus $a
with frequency f_a($w,$v)
-- i.e. f_b($w,$v)==0
. Since only collocates with nonzero co-occurrence frequencies are collected in subcorpus profiles, the sub-profile for $w
over subcorpus $b
will not contain an entry for $v
at all. This is fine if we are only interested in the total co-occurrence frequency f($w,$v) = f_a($w,$v) + f_b($w,$v)
, but if we are using an "interesting" association score, we also need to refer to the total independent collocate frequency f($v) = f_a($v) + f_b($v)
, but since f_b($v)
will not have been reported by the subprofile for corpus $b
, its value will be treated as 0 (zero), leading to an incorrect estimate of the association score.
An adequate solution to this problem will probably require extending the DiaColloDB::Client|DiaColloDB::Client
and DiaColloDB::Relation|DiaColloDB::Relation
APIs with method(s) for acquiring correct independent collocate frequencies on a relation-dependent basis given a set of candidate collocates (e.g. in the form of a partial profile), and will necessarily involve an additional round-trip for each subcorpus to ensure correct f2 values in list-client profiles. Until these issues are addressed, it is recommended that you avoid using list-clients together with f2-sensitive association scores. In the meantime, you can use the DiaColloDB::union() method via the -union option to dcdb-create.perl to merge multiple local DiaCollo index directories into a single monolithic index.
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2016 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.