##======================================================================== ## POD DOCUMENTATION, auto-generated by podextract.perl ##======================================================================== ## NAME =pod =head1 NAME DiaColloDB::Client::list - diachronic collocation db: client: distributed =cut ##======================================================================== ## DESCRIPTION =pod =head1 DESCRIPTION DiaColloDB::Client::list is a subclass of L<DiaColloDB::Client|DiaColloDB::Client> for accessing a set of distributed L<DiaColloDB|DiaColloDB> databases via a C<list://> URL whose path part is a space- or colon--separated list of sub-URLs supported by L<DiaColloDB::Client|DiaColloDB::Client>. It supports the L<DiaColloDB::Client|DiaColloDB::Client> API by calling the relevant methods on each of its sub-clients. new() options and object structure: ##-- DiaColloDB::Client: options url => $url, ##-- list url (sub-urls, separated by whitespace, "+SCHEME://", or "+://") ## ##-- DiaColloDB::Client::list urls => \@urls, ##-- sub-urls opts => \%opts, ##-- sub-client options fudge => $fudge, ##-- get ($fudge*$kbest) items from sub-clients (0:all; default=10) logFudge => $level, ##-- log-level for fudge-coefficient debugging (default='debug') ## ##-- guts clis => \@clis, ##-- per-url clients The most important client parameter is the fudge-coefficient option C<fudge=E<gt>$fudge>, which requests that up to C<$fudge*$kbest> items be retrieved from sub-clients for each L<profile()|profile> call. If C<$fudge E<lt>= 0>, all collocates will be retrieved from each sub-client, and trimming will be performed exclusively by the superordinate DiaColloDB::Client::list object. The default value of 10 should return reasonable results without too large of a performance penalty in most cases, but be aware that the results may not be strictly correct due to sub-client local pruning; see L<|/KNOWN BUGS> for details. =head2 List URLs List URLs passed as the the C<url> option to the constructor can be either ARRAY-refs of sub-URLs or simple strings with an optional C<list://> scheme. In the latter case, sub-URLs in the argument string are separated by whitespace or by a plus character ("+") followed by the sub-URL scheme, e.g.: ["file://a","file://b"] ##-- ARRAY-ref of explicit file URLs ["a" , "b" ] ##-- ARRAY-ref of implicit file URLs "list://file://a file://b" ##-- string with space-separated explicit file URLs "list://a b" ##-- string with space-separated implicit file URLs "list://file://a+file://b" ##-- list with "+"-separated explicit file URLs "list://a+://b" ##-- list with "+"-separated implicit file URLs Options can be passed to the appropriate sub-URLs via those URLs' query strings, as described in L<DiaColloDB::Client/open>. Options to the DiaColloDB::Client::list object itself can be passed in by using a sub-URL consisting of a HASH-ref or only a query string, e.g.: ["a","b",{fudge=>0}] ##-- ARRAY-ref with local options as HASH-ref ["a","b","?fudge=0"] ##-- ARRAY-ref with local options as query-string "list://a b ?fudge=0" ##-- space-sparated string with local options "list://a+://b+://?fudge=0" ##-- "+"-separated string with local options =cut ##====================================================================== ## Footer ##====================================================================== =pod =head1 KNOWN BUGS =head2 Incorrect Independent Collocate Frequencies The evaluation strategy currently used by this package is not strictly correct, even when C<$fudge==0>. Although the reported join frequencies I<f12> ought to be correct in this case, it can easily happen that the independent collocate frequencies I<f2> get mis-reported, leading to incorrect computations of I<f2>-sensitive association scores such as C<mi> (pointwise mutual information * log-frequency product), C<ll> (log likelihood), or the default C<ld> (log Dice). Such errors occur whenever the list client accesses multiple sub-clients (e.g. C<$a> and C<$b>) and a candidate collocate C<$v> occurs in both of the subcorpora, but only occurs together with the target term C<$w> in one of the sub-clients' indices. Suppose C<$v> occurs in subcorpus C<$a> with frequency C<f_a($v)> and in subcorpus C<$b> with frequency C<f_b($v)>, but only occurs together with C<$w> in subcorpus C<$a> with frequency C<f_a($w,$v)> -- i.e. C<f_b($w,$v)==0>. Since only collocates with nonzero co-occurrence frequencies are collected in subcorpus profiles, the sub-profile for C<$w> over subcorpus C<$b> will not contain an entry for C<$v> at all. This is fine if we are only interested in the total co-occurrence frequency C<f($w,$v) = f_a($w,$v) + f_b($w,$v)>, but if we are using an "interesting" association score, we also need to refer to the total independent collocate frequency C<f($v) = f_a($v) + f_b($v)>, but since C<f_b($v)> will not have been reported by the subprofile for corpus C<$b>, its value will be treated as 0 (zero), leading to an incorrect estimate of the association score. An adequate solution to this problem will probably require extending the C<DiaColloDB::Client|DiaColloDB::Client> and C<DiaColloDB::Relation|DiaColloDB::Relation> APIs with method(s) for acquiring correct independent collocate frequencies on a relation-dependent basis given a set of candidate collocates (e.g. in the form of a partial profile), and will necessarily involve an additional round-trip for each subcorpus to ensure correct I<f2> values in list-client profiles. Until these issues are addressed, it is recommended that you avoid using list-clients together with I<f2>-sensitive association scores. In the meantime, you can use the L<DiaColloDB::union()|DiaColloDB/union> method via the L<-union|dcdb-create.perl/union> option to L<dcdb-create.perl|dcdb-create.perl> to merge multiple local DiaCollo index directories into a single monolithic index. =cut ##====================================================================== ## Footer ##====================================================================== =pod =head1 AUTHOR Bryan Jurish E<lt>moocow@cpan.orgE<gt> =head1 COPYRIGHT AND LICENSE Copyright (C) 2015-2016 by Bryan Jurish This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available. =head1 SEE ALSO L<DiaColloDB::Client(3pm)|DiaColloDB::Client>, L<DiaColloDB(3pm)|DiaColloDB>, L<perl(1)|perl>, ... =cut