NAME
DiaColloDB::Profile - diachronic collocation db, (co-)frequency profile
SYNOPSIS
##========================================================================
## PRELIMINARIES
use DiaColloDB::Profile;
##========================================================================
## Constructors etc.
$prf = CLASS_OR_OBJECT->new(%args);
$prf2 = $prf->clone();
$prf2 = $prf->shadow();
##========================================================================
## Basic Access
$label = $prf->label();
\@titles_or_undef = $prf->titles();
@keys = $prf->scoreKeys();
$bool = $prf->empty();
##========================================================================
## I/O: JSON
*TO_JSON = \&TO_JSON__table;
##========================================================================
## I/O: Text
undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);
$bool = $prf->saveTextFh($fh, %opts);
##========================================================================
## I/O: HTML
$bool = $prf->saveHtmlFile($filename_or_handle, %opts);
##========================================================================
## Compilation
$prf = $prf->compile($func,%opts);
$prf = $prf->uncompile();
$prf = $prf->compile_f();
$prf = $prf->compile_lf();
$prf = $prf->compile_lfm();
$prf = $prf->compile_fm();
$prf = $prf->compile_mi(%opts);
$prf = $prf->compile_mi3(%opts);
$prf = $prf->compile_ld(%opts);
$prf = $prf->compile_ll(%opts);
##========================================================================
## Trimming
\@keys = $prf->which(%opts);
$prf = $prf->trim(%opts);
##========================================================================
## Stringification
$i2s = $prf->stringify_map( $obj);
$prf = $prf->stringify( $obj);
##========================================================================
## Algebraic operations
$prf = $prf->_add($prf2,%opts);
$prf3 = $prf1->add($prf2,%opts);
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);
$diff = $prf1->diff($prf2,%opts);
DESCRIPTION
DiaColloDB::Profile is a class for representing low-level collocate frequency profile data for a single date-slice as retrieved e.g. from a native index or DDC back-end. It includes methods for compiling profile scores via several score functions (e.g. frequency, pointwise mi * log-frequency, log Dice), k-best trimming, stringification, basic algebraic manipulation, and serialization (text, HTML, or JSON).
Globals & Constants
- Variable: @ISA
-
DiaColloDB::Profile inherits from DiaColloDB::Persistent.
Constructors etc.
- new
-
$prf = CLASS_OR_OBJECT->new(%args);
%args, object structure:
label => $label, ##-- string label (used by Multi; undef for none(default)) N => $N, ##-- total marginal relation frequency f1 => $f1, ##-- total marginal frequency of target word(s) f2 => \%f2, ##-- total marginal frequency of collocates: ($i2=>$f2, ...) f12 => \%f12, ##-- collocation frequencies, %f12 = ($i2=>$f12, ...) titles => \@titles, ##-- item group titles (default:undef: unknown) ## eps => $eps, ##-- smoothing constant (default=0.5) score => $func, ##-- selected scoring function qw(f fm lf lfm mi mi3 ld ll) mi => \%mi12, ##-- score: mutual information * logFreq a la Wortprofil; requires compile_mi() mi3 => \%mi312, ##-- score: mutual information^3 a la Rychlý (2008); requires compile_mi3() ld => \%ld12, ##-- score: log-dice a la Wortprofil; requires compile_ld() ll => \%ll12, ##-- score: 1-sided log-likelihood a la Evert (2008); requires compile_ll() fm => \%fm12, ##-- frequency per million score; requires compile_fm() lf => \%lf12, ##-- log-frequency ; requires compile_lf() lfm => \%lfm12, ##-- log-frequency per million; requires compile_lfm()
- clone
-
$prf2 = $prf->clone(); $prf2 = $prf->clone($keep_compiled)
clones the profile $prf. if $keep_score is true, compiled data is cloned too.
- shadow
-
$prf2 = $prf->shadow(); $prf2 = $prf->shadow($keep_compiled)
shadows %$prf. if $keep_score is true, compiled data is shadowed too (all zeroes).
Basic Access
- label
-
$label = $prf->label();
get profile label
- titles
-
\@titles_or_undef = $prf->titles();
get item titles
- scoreKeys
-
@keys = $prf->scoreKeys();
returns known score function keys
- empty
-
$bool = $prf->empty();
returns true iff profile is empty
I/O: JSON
- TO_JSON__table
-
$thingy = $obj->TO_JSON__table()
test alternative JSON format (small but slow).
- TO_JSON__flat
-
$thingy = $obj->TO_JSON__flat()
test alternative JSON format (small but slow).
I/O: Text
See also DiaColloDB::Persistent.
- saveTextHeader
-
undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);
prints column titles for text output.
- saveTextFh
-
$bool = $prf->saveTextFh($fh, %opts);
save flat TAB-separated text, format:
N F1 F2 F12 SCORE LABEL ITEM2...
%opts:
label => $label, ##-- override $prf->{label} (used by Profile::Multi), no tab-separators required format => $fmt, ##-- printf format for scores (default="%f") header => $bool, ##-- include header-row? (default=1) hlabel => $hlabel, ##-- prefix header item-cells with $hlabel (used by Profile::Multi)
I/O: HTML
- saveHtmlFile
-
$bool = $prf->saveHtmlFile($filename_or_handle, %opts);
Save flat HTML table data with rows of the form
N F1 F2 F12 SCORE PREFIX? ITEM2...
%opts:
table => $bool, ##-- include <table>..</table> ? (default=1) body => $bool, ##-- include <html><body>..</html></body> ? (default=1) header => $bool, ##-- include header-row? (default=1) hlabel => $hlabel, ##-- prefix header item-cells with $hlabel (used by Profile::Multi), no '<th>..</th>' required label => $label, ##-- prefix item-cells with $label (used by Profile::Multi), no '<td>..</td>' required format => $fmt, ##-- printf score formatting (default="%.4f")
Compilation
- compile
-
$prf = $prf->compile($func,%opts);
compile for score-function $func, one of qw(f fm mi ld); default='f'
- uncompile
-
$prf = $prf->uncompile();
un-compiles all scores for $prf
- compile_f
-
$prf = $prf->compile_f();
just sets $prf->{score} = 'f12'
- compile_lf
-
$prf = $prf->compile_lf();
computes log-frequency profile in $prf->{lf}; sets $prf->{score}='lf'.
- compile_fm
-
$prf = $prf->compile_fm();
computes frequency-per-million in $prf->{fm}; sets $prf->{score}='fm'.
- compile_lfm
-
$prf = $prf->compile_lfm(%opts);
computes log-frequency-per-million in $prf->{lfm} sets $prf->{score}='lfm'.
- compile_mi
-
$prf = $prf->compile_mi(%opts);
computes MI*logF-profile in $prf->{mi} a la Rychlý (2008); sets $prf->{score}='mi'. %opts:
eps => $eps #-- clobber $prf->{eps}
- compile_mi3
-
$prf = $prf->compile_mi3(%opts);
computes MI^3 profile in $prf->{mi} a la Rychlý (2008); sets $prf->{score}='mi3'.
- compile_ld
-
$prf = $prf->compile_ld(%opts);
computes log-dice profile in $prf->{ld} a la Rychlý (2008); sets $pf->{score}='ld'.
%opts:
eps => $eps #-- clobber $prf->{eps}
- compile_ll
-
$prf = $prf->compile_ll(%opts);
computes 1-sided log-log-likelihood ratio in $prf->{ll} a la Evert (2008); sets $pf->{score}='ll'.
%opts:
eps => $eps #-- clobber $prf-E<gt>{eps}
Trimming
- which
-
\@keys = $prf->which(%opts);
returns 'good' keys for trimming options %opts:
cutoff => $cutoff, ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff kbest => $kbest, ##-- retain only $kbest items kbesta => $kbesta, ##-- retain only $kbest items (absolute value) return => $which, ##-- either 'good' (default) or 'bad' as => $as, ##-- 'hash' or 'array'; default='array'
- trim
-
$prf = $prf->trim(%opts);
trim profile to contain only 'good' keys.
%opts:
kbest => $kbest, ##-- retain only $kbest items (by score value) kbesta => $kbesta, ##-- retain only $kbest items (by score absolute value) cutoff => $cutoff, ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff keep => $keep, ##-- retain keys @$keep (ARRAY) or keys(%$keep) (HASH) drop => $drop, ##-- drop keys @$drop (ARRAY) or keys(%$drop) (HASH)
NOTE: this COULD be factored out into s.t. like $prf->trim($prf->which(%opts)), but it's about 15% faster inline.
Stringification
- stringify_map
-
$i2s = $prf->stringify_map( $obj); $i2s = $prf->stringify_map(\@key2str); $i2s = $prf->stringify_map(\&key2str); $i2s = $prf->stringify_map(\%key2str);
guts for stringify: get a map for stringification
- stringify
-
$prf = $prf->stringify( $obj); $prf = $prf->stringify(\@key2str) $prf = $prf->stringify(\&key2str) $prf = $prf->stringify(\%key2str)
stringifies profile (destructive) via $obj->i2s($key2), $key2str->($i2) or $key2str->{$i2}.
Algebraic operations
- _add
-
$prf = $prf->_add($prf2,%opts);
adds $prf2 frequency data to $prf (destructive); implicitly un-compiles $prf.
%opts:
N => $bool, ##-- whether to add N values (default:true) f1 => $bool, ##-- whether to add f1 values (default:true)
- add
-
$prf3 = $prf1->add($prf2,%opts);
returns sum of $prf1 and $prf2 frequency data (destructive). %opts: as for _add().
- _sum
-
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
returns a profile representing sum of \@profiles, passing %opts to _add().
if called as a class method and \@profiles contains only 1 element, that element is returned
otherwise, \@profiles are added to the (new) object
- sum
-
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);
returns a new profile representing sum of \@profiles; see _sum().
- diff
-
$diff = $prf1->diff($prf2,%opts);
wraps DiaColloDB::Profile::Diff->new($prf1,$prf2,%opts).
AUTHOR
Bryan Jurish <moocow@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2015-2016 by Bryan Jurish
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.
REFERENCES
Didakowski, J. and Geyken, A. (2013). "From DWDS corpora to a German Word Profile – methodological problems and solutions. In: Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information". 2nd Work Report of the Academic Network "Internet Lexicography". Mannheim: Institut für Deutsche Sprache. (OPAL - Online publizierte Arbeiten zur Linguistik X/2012), S. 43-52. URL http://www.dwds.de/static/website/publications/pdf/didakowski_geyken_internetlexikografie_2012_final.pdf
Evert, S. (2008). "Corpora and collocations." In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58, pages 1212-1248. Mouton de Gruyter, Berlin. URL (extended manuscript): http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf
Kilgarriff, A. and Tugwell, D. (2002). "Sketching words". In M.-H. Corréard (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins. EURALEX, 125-137. URL http://www.kilgarriff.co.uk/Publications/2002-KilgTugwell-AtkinsFest.pdf
Rychlý, P. (2008). "A lexicographer-friendly association score". In P. Sojka and A. Horák (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing. RASLAN 2008, 69. URL http://www.muni.cz/research/publications/937193, http://www.fi.muni.cz/usr/sojka/download/raslan2008/13.pdf
SEE ALSO
DiaColloDB::Persistent(3pm), DiaColloDB::Profile::Diff(3pm), DiaColloDB::Profile::Multi(3pm), DiaColloDB::Profile::MultiDiff(3pm), DiaColloDB::Relation(3pm), DiaColloDB(3pm), perl(1), ...