NAME

DiaColloDB::Client::list - diachronic collocation db: client: distributed

DESCRIPTION

DiaColloDB::Client::list is a subclass of DiaColloDB::Client for accessing a set of distributed DiaColloDB databases via a list:// URL whose path part is a space- or colon--separated list of sub-URLs supported by DiaColloDB::Client. It supports the DiaColloDB::Client API by calling the relevant methods on each of its sub-clients.

new() options and object structure:

##-- DiaColloDB::Client: options
url  => $url,       ##-- list url (sub-urls, separated by whitespace, "+SCHEME://", or "+://")
##
##-- DiaColloDB::Client::list
urls  => \@urls,     ##-- sub-urls
opts  => \%opts,     ##-- sub-client options
fudge => $fudge,     ##-- get ($fudge*$kbest) items from sub-clients (0:all; default=10)
logFudge => $level,  ##-- log-level for fudge-coefficient debugging (default='debug')
##
##-- guts
clis => \@clis,      ##-- per-url clients

The most important client parameter is the fudge-coefficient option fudge=>$fudge, which requests that up to $fudge*$kbest items be retrieved from sub-clients for each profile() call. If $fudge <= 0, all collocates will be retrieved from each sub-client, and trimming will be performed exclusively by the superordinate DiaColloDB::Client::list object. The default value of 10 should return reasonable results without too large of a performance penalty in most cases, but be aware that the results may not be strictly correct due to sub-client local pruning; see for details.

List URLs

List URLs passed as the the url option to the constructor can be either ARRAY-refs of sub-URLs or simple strings with an optional list:// scheme. In the latter case, sub-URLs in the argument string are separated by whitespace or by a plus character ("+") followed by the sub-URL scheme, e.g.:

["file://a","file://b"]        ##-- ARRAY-ref of explicit file URLs
["a"       , "b"      ]        ##-- ARRAY-ref of implicit file URLs

"list://file://a file://b"     ##-- string with space-separated explicit file URLs
"list://a b"                   ##-- string with space-separated implicit file URLs

"list://file://a+file://b"     ##-- list with "+"-separated explicit file URLs
"list://a+://b"                ##-- list with "+"-separated implicit file URLs

Options can be passed to the appropriate sub-URLs via those URLs' query strings, as described in "open" in DiaColloDB::Client. Options to the DiaColloDB::Client::list object itself can be passed in by using a sub-URL consisting of a HASH-ref or only a query string, e.g.:

["a","b",{fudge=>0}]           ##-- ARRAY-ref with local options as HASH-ref
["a","b","?fudge=0"]           ##-- ARRAY-ref with local options as query-string

"list://a b ?fudge=0"          ##-- space-sparated string with local options
"list://a+://b+://?fudge=0"    ##-- "+"-separated string with local options

KNOWN BUGS

Incorrect Independent Collocate Frequencies

The evaluation strategy currently used by this package is not strictly correct, even when $fudge==0. Although the reported join frequencies f12 ought to be correct in this case, it can easily happen that the independent collocate frequencies f2 get mis-reported, leading to incorrect computations of f2-sensitive association scores such as mi (pointwise mutual information * log-frequency product), ll (log likelihood), or the default ld (log Dice). Such errors occur whenever the list client accesses multiple sub-clients (e.g. $a and $b) and a candidate collocate $v occurs in both of the subcorpora, but only occurs together with the target term $w in one of the sub-clients' indices.

Suppose $v occurs in subcorpus $a with frequency f_a($v) and in subcorpus $b with frequency f_b($v), but only occurs together with $w in subcorpus $a with frequency f_a($w,$v) -- i.e. f_b($w,$v)==0. Since only collocates with nonzero co-occurrence frequencies are collected in subcorpus profiles, the sub-profile for $w over subcorpus $b will not contain an entry for $v at all. This is fine if we are only interested in the total co-occurrence frequency f($w,$v) = f_a($w,$v) + f_b($w,$v), but if we are using an "interesting" association score, we also need to refer to the total independent collocate frequency f($v) = f_a($v) + f_b($v), but since f_b($v) will not have been reported by the subprofile for corpus $b, its value will be treated as 0 (zero), leading to an incorrect estimate of the association score.

An adequate solution to this problem will probably require extending the DiaColloDB::Client|DiaColloDB::Client and DiaColloDB::Relation|DiaColloDB::Relation APIs with method(s) for acquiring correct independent collocate frequencies on a relation-dependent basis given a set of candidate collocates (e.g. in the form of a partial profile), and will necessarily involve an additional round-trip for each subcorpus to ensure correct f2 values in list-client profiles. Until these issues are addressed, it is recommended that you avoid using list-clients together with f2-sensitive association scores. In the meantime, you can use the DiaColloDB::union() method via the -union option to dcdb-create.perl to merge multiple local DiaCollo index directories into a single monolithic index.

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2016 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

SEE ALSO

DiaColloDB::Client(3pm), DiaColloDB(3pm), perl(1), ...