NAME
DiaColloDB::Profile - diachronic collocation db, (co-)frequency profile
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Profile;
##========================================================================
## Constructors etc.
$prf = CLASS_OR_OBJECT->new(%args);
$prf2 = $prf->clone();
$prf2 = $prf->shadow();
##========================================================================
## Basic Access
$label = $prf->label();
\@titles_or_undef = $prf->titles();
@keys = $prf->scoreKeys();
$bool = $prf->empty();
##========================================================================
## I/O: JSON
*TO_JSON = \&TO_JSON__table;
##========================================================================
## I/O: Text
undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);
$bool = $prf->saveTextFh($fh, %opts);
##========================================================================
## I/O: HTML
$bool = $prf->saveHtmlFile($filename_or_handle, %opts);
##========================================================================
## Compilation
$prf = $prf->compile($func,%opts);
$prf = $prf->uncompile();
$prf = $prf->compile_f();
$prf = $prf->compile_lf();
$prf = $prf->compile_lfm();
$prf = $prf->compile_fm();
$prf = $prf->compile_mi(%opts);
$prf = $prf->compile_mi3(%opts);
$prf = $prf->compile_ld(%opts);
$prf = $prf->compile_ll(%opts);
##========================================================================
## Trimming
\@keys = $prf->which(%opts);
$prf = $prf->trim(%opts);
##========================================================================
## Stringification
$i2s = $prf->stringify_map( $obj);
$prf = $prf->stringify( $obj);
##========================================================================
## Algebraic operations
$prf = $prf->_add($prf2,%opts);
$prf3 = $prf1->add($prf2,%opts);
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);
$diff = $prf1->diff($prf2,%opts);
DESCRIPTION
DiaColloDB::Profile is a class for representing low-level collocate frequency profile data for a single date-slice as retrieved e.g. from a native index or DDC back-end. It includes methods for compiling profile scores via several score functions (e.g. frequency, pointwise mi * log-frequency, log Dice), k-best trimming, stringification, basic algebraic manipulation, and serialization (text, HTML, or JSON).
Globals & Constants
- Variable: @ISA
-
DiaColloDB::Profile inherits from DiaColloDB::Persistent.
Constructors etc.
- new
-
$prf = CLASS_OR_OBJECT->new(%args);
%args, object structure:
label => $label, ##-- string label (used by Multi; undef for none(default)) N => $N, ##-- total marginal relation frequency f1 => $f1, ##-- total marginal frequency of target word(s) f2 => \%f2, ##-- total marginal frequency of collocates: ($i2=>$f2, ...) f12 => \%f12, ##-- collocation frequencies, %f12 = ($i2=>$f12, ...) titles => \@titles, ##-- item group titles (default:undef: unknown) ## eps => $eps, ##-- smoothing constant (default=0.5) score => $func, ##-- selected scoring function ('f12', 'mi', or 'ld') mi => \%mi12, ##-- score: mutual information * logFreq a la Wortprofil; requires compile_mi() mi3 => \%mi312, ##-- score: mutual information^3 a la Rychlý (2008); requires compile_mi3() ld => \%ld12, ##-- score: log-dice a la Wortprofil; requires compile_ld() ll => \%ll12, ##-- score: 1-sided log-likelihood a la Evert (2008); requires compile_ll() fm => \%fm12, ##-- frequency per million score; requires compile_fm() lf => \%lf12, ##-- log-frequency ; requires compile_lf() lfm => \%lfm12, ##-- log-frequency per million; requires compile_lfm()
- clone
-
$prf2 = $prf->clone(); $prf2 = $prf->clone($keep_compiled)
clones the profile $prf. if $keep_score is true, compiled data is cloned too.
- shadow
-
$prf2 = $prf->shadow(); $prf2 = $prf->shadow($keep_compiled)
shadows %$prf. if $keep_score is true, compiled data is shadowed too (all zeroes).
Basic Access
- label
-
$label = $prf->label();
get profile label
- titles
-
\@titles_or_undef = $prf->titles();
get item titles
- scoreKeys
-
@keys = $prf->scoreKeys();
returns known score function keys
- empty
-
$bool = $prf->empty();
returns true iff profile is empty
I/O: JSON
- TO_JSON__table
-
$thingy = $obj->TO_JSON__table()
test alternative JSON format (small but slow).
- TO_JSON__flat
-
$thingy = $obj->TO_JSON__flat()
test alternative JSON format (small but slow).
I/O: Text
See also DiaColloDB::Persistent.
- saveTextHeader
-
undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);
prints column titles for text output.
- saveTextFh
-
$bool = $prf->saveTextFh($fh, %opts);
save flat TAB-separated text, format:
N F1 F2 F12 SCORE LABEL ITEM2...
%opts:
label => $label, ##-- override $prf->{label} (used by Profile::Multi), no tab-separators required format => $fmt, ##-- printf format for scores (default="%f") header => $bool, ##-- include header-row? (default=1) hlabel => $hlabel, ##-- prefix header item-cells with $hlabel (used by Profile::Multi)
I/O: HTML
- saveHtmlFile
-
$bool = $prf->saveHtmlFile($filename_or_handle, %opts);
Save flat HTML table data with rows of the form
N F1 F2 F12 SCORE PREFIX? ITEM2...
%opts:
table => $bool, ##-- include <table>..</table> ? (default=1) body => $bool, ##-- include <html><body>..</html></body> ? (default=1) header => $bool, ##-- include header-row? (default=1) hlabel => $hlabel, ##-- prefix header item-cells with $hlabel (used by Profile::Multi), no '<th>..</th>' required label => $label, ##-- prefix item-cells with $label (used by Profile::Multi), no '<td>..</td>' required format => $fmt, ##-- printf score formatting (default="%.4f")
Compilation
- compile
-
$prf = $prf->compile($func,%opts);
compile for score-function $func, one of qw(f fm mi ld); default='f'
- uncompile
-
$prf = $prf->uncompile();
un-compiles all scores for $prf
- compile_f
-
$prf = $prf->compile_f();
just sets $prf->{score} = 'f12'
- compile_lf
-
$prf = $prf->compile_lf();
computes log-frequency profile in $prf->{lf}; sets $prf->{score}='lf'.
- compile_fm
-
$prf = $prf->compile_fm();
computes frequency-per-million in $prf->{fm}; sets $prf->{score}='fm'.
- compile_lfm
-
$prf = $prf->compile_lfm(%opts);
computes log-frequency-per-million in $prf->{lfm} sets $prf->{score}='lfm'.
- compile_mi
-
$prf = $prf->compile_mi(%opts);
computes MI*logF-profile in $prf->{mi} a la Rychlý (2008); sets $prf->{score}='mi'. %opts:
eps => $eps #-- clobber $prf->{eps}
- compile_mi3
-
$prf = $prf->compile_mi3(%opts);
computes MI^3 profile in $prf->{mi} a la Rychlý (2008); sets $prf->{score}='mi3'.
- compile_ld
-
$prf = $prf->compile_ld(%opts);
computes log-dice profile in $prf->{ld} a la Rychlý (2008); sets $pf->{score}='ld'.
%opts:
eps => $eps #-- clobber $prf->{eps}
- compile_ll
-
$prf = $prf->compile_ll(%opts);
computes 1-sided log-log-likelihood ratio in $prf->{ll} a la Evert (2008); sets $pf->{score}='ll'.
%opts:
eps => $eps #-- clobber $prf-E<gt>{eps}
Trimming
- which
-
\@keys = $prf->which(%opts);
returns 'good' keys for trimming options %opts:
cutoff => $cutoff, ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff kbest => $kbest, ##-- retain only $kbest items kbesta => $kbesta, ##-- retain only $kbest items (absolute value) return => $which, ##-- either 'good' (default) or 'bad' as => $as, ##-- 'hash' or 'array'; default='array'
- trim
-
$prf = $prf->trim(%opts);
trim profile to contain only 'good' keys.
%opts:
kbest => $kbest, ##-- retain only $kbest items (by score value) kbesta => $kbesta, ##-- retain only $kbest items (by score absolute value) cutoff => $cutoff, ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff keep => $keep, ##-- retain keys @$keep (ARRAY) or keys(%$keep) (HASH) drop => $drop, ##-- drop keys @$drop (ARRAY) or keys(%$drop) (HASH)
NOTE: this COULD be factored out into s.t. like $prf->trim($prf->which(%opts)), but it's about 15% faster inline.
Stringification
- stringify_map
-
$i2s = $prf->stringify_map( $obj); $i2s = $prf->stringify_map(\@key2str); $i2s = $prf->stringify_map(\&key2str); $i2s = $prf->stringify_map(\%key2str);
guts for stringify: get a map for stringification
- stringify
-
$prf = $prf->stringify( $obj); $prf = $prf->stringify(\@key2str) $prf = $prf->stringify(\&key2str) $prf = $prf->stringify(\%key2str)
stringifies profile (destructive) via $obj->i2s($key2), $key2str->($i2) or $key2str->{$i2}.
Algebraic operations
- _add
-
$prf = $prf->_add($prf2,%opts);
adds $prf2 frequency data to $prf (destructive); implicitly un-compiles $prf.
%opts:
N => $bool, ##-- whether to add N values (default:true) f1 => $bool, ##-- whether to add f1 values (default:true)
- add
-
$prf3 = $prf1->add($prf2,%opts);
returns sum of $prf1 and $prf2 frequency data (destructive). %opts: as for _add().
- _sum
-
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
returns a profile representing sum of \@profiles, passing %opts to _add().
if called as a class method and \@profiles contains only 1 element, that element is returned
otherwise, \@profiles are added to the (new) object
- sum
-
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);
returns a new profile representing sum of \@profiles; see _sum().
- diff
-
$diff = $prf1->diff($prf2,%opts);
wraps DiaColloDB::Profile::Diff->new($prf1,$prf2,%opts).
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2016 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
REFERENCES
Didakowski, J. and Geyken, A. (2013). "From DWDS corpora to a German Word Profile â methodological problems and solutions. In: Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information". 2nd Work Report of the Academic Network "Internet Lexicography". Mannheim: Institut für Deutsche Sprache. (OPAL - Online publizierte Arbeiten zur Linguistik X/2012), S. 43-52. URL http://www.dwds.de/static/website/publications/pdf/didakowski_geyken_internetlexikografie_2012_final.pdf
Evert, S. (2008). "Corpora and collocations." In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58, pages 1212-1248. Mouton de Gruyter, Berlin. URL (extended manuscript): http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf
Kilgarriff, A. and Tugwell, D. (2002). "Sketching words". In M.-H. Corréard (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins. EURALEX, 125-137. URL http://www.kilgarriff.co.uk/Publications/2002-KilgTugwell-AtkinsFest.pdf
Rychlý, P. (2008). "A lexicographer-friendly association score". In P. Sojka and A. Horák (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing. RASLAN 2008, 6Â9. URL "/www.muni.cz/research/publications/937193 , http://www.fi.muni.cz/usr/sojka/download/raslan2008/13.pdf" in http:
SEE ALSO
DiaColloDB::Persistent(3pm), DiaColloDB::Profile::Diff(3pm), DiaColloDB::Profile::Multi(3pm), DiaColloDB::Profile::MultiDiff(3pm), DiaColloDB::Relation(3pm), DiaColloDB(3pm), perl(1), ...
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 5:
This document probably does not appear as it should, because its "=encoding ut8" line calls for an unsupported encoding. [Encode.pm v3.17's supported encodings are: 7bit-jis AdobeStandardEncoding AdobeSymbol AdobeZdingbat ascii ascii-ctrl big5-eten big5-hkscs cp1006 cp1026 cp1047 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 cp37 cp424 cp437 cp500 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp858 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp875 cp932 cp936 cp949 cp950 dingbats euc-cn euc-jp euc-kr gb12345-raw gb2312-raw gsm0338 hp-roman8 hz iso-2022-jp iso-2022-jp-1 iso-2022-kr iso-8859-1 iso-8859-10 iso-8859-11 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-ir-165 jis0201-raw jis0208-raw jis0212-raw johab koi8-f koi8-r koi8-u ksc5601-raw MacArabic MacCentralEurRoman MacChineseSimp MacChineseTrad MacCroatian MacCyrillic MacDingbats MacFarsi MacGreek MacHebrew MacIcelandic MacJapanese MacKorean MacRoman MacRomanian MacRumanian MacSami MacSymbol MacThai MacTurkish MacUkrainian MIME-B MIME-Header MIME-Header-ISO_2022_JP MIME-Q nextstep null posix-bc shiftjis symbol UCS-2BE UCS-2LE UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UTF-7 utf-8-strict utf8 viscii]
Couldn't do =encoding ut8: This document probably does not appear as it should, because its "=encoding ut8" line calls for an unsupported encoding. [Encode.pm v3.17's supported encodings are: 7bit-jis AdobeStandardEncoding AdobeSymbol AdobeZdingbat ascii ascii-ctrl big5-eten big5-hkscs cp1006 cp1026 cp1047 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 cp37 cp424 cp437 cp500 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp858 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp875 cp932 cp936 cp949 cp950 dingbats euc-cn euc-jp euc-kr gb12345-raw gb2312-raw gsm0338 hp-roman8 hz iso-2022-jp iso-2022-jp-1 iso-2022-kr iso-8859-1 iso-8859-10 iso-8859-11 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-ir-165 jis0201-raw jis0208-raw jis0212-raw johab koi8-f koi8-r koi8-u ksc5601-raw MacArabic MacCentralEurRoman MacChineseSimp MacChineseTrad MacCroatian MacCyrillic MacDingbats MacFarsi MacGreek MacHebrew MacIcelandic MacJapanese MacKorean MacRoman MacRomanian MacRumanian MacSami MacSymbol MacThai MacTurkish MacUkrainian MIME-B MIME-Header MIME-Header-ISO_2022_JP MIME-Q nextstep null posix-bc shiftjis symbol UCS-2BE UCS-2LE UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UTF-7 utf-8-strict utf8 viscii]