NAME

DiaColloDB::Profile - diachronic collocation db, (co-)frequency profile

SYNOPSIS

##========================================================================
## PRELIMINARIES

use DiaColloDB::Profile;

##========================================================================
## Constructors etc.

$prf = CLASS_OR_OBJECT->new(%args);
$prf2 = $prf->clone();
$prf2 = $prf->shadow();

##========================================================================
## Basic Access

$label = $prf->label();
\@titles_or_undef = $prf->titles();
@keys = $prf->scoreKeys();
$bool = $prf->empty();

##========================================================================
## I/O: JSON

 *TO_JSON = \&TO_JSON__table;

##========================================================================
## I/O: Text

undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);
$bool = $prf->saveTextFh($fh, %opts);

##========================================================================
## I/O: HTML

$bool = $prf->saveHtmlFile($filename_or_handle, %opts);

##========================================================================
## Compilation

$prf = $prf->compile($func,%opts);
$prf = $prf->uncompile();
$prf = $prf->compile_f();
$prf = $prf->compile_lf();
$prf = $prf->compile_lfm();
$prf = $prf->compile_fm();
$prf = $prf->compile_mi(%opts);
$prf = $prf->compile_mi3(%opts);
$prf = $prf->compile_ld(%opts);
$prf = $prf->compile_ll(%opts);

##========================================================================
## Trimming

\@keys = $prf->which(%opts);
$prf   = $prf->trim(%opts);

##========================================================================
## Stringification

$i2s = $prf->stringify_map( $obj);
$prf = $prf->stringify( $obj);

##========================================================================
## Algebraic operations

$prf = $prf->_add($prf2,%opts);
$prf3 = $prf1->add($prf2,%opts);
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);
$diff = $prf1->diff($prf2,%opts);

DESCRIPTION

DiaColloDB::Profile is a class for representing low-level collocate frequency profile data for a single date-slice as retrieved e.g. from a native index or DDC back-end. It includes methods for compiling profile scores via several score functions (e.g. frequency, pointwise mi * log-frequency, log Dice), k-best trimming, stringification, basic algebraic manipulation, and serialization (text, HTML, or JSON).

Globals & Constants

Variable: @ISA

DiaColloDB::Profile inherits from DiaColloDB::Persistent.

Constructors etc.

new
$prf = CLASS_OR_OBJECT->new(%args);

%args, object structure:

label => $label,    ##-- string label (used by Multi; undef for none(default))
N   => $N,          ##-- total marginal relation frequency
f1  => $f1,         ##-- total marginal frequency of target word(s)
f2  => \%f2,        ##-- total marginal frequency of collocates: ($i2=>$f2, ...)
f12 => \%f12,       ##-- collocation frequencies, %f12 = ($i2=>$f12, ...)
titles => \@titles, ##-- item group titles (default:undef: unknown)
##
eps => $eps,        ##-- smoothing constant (default=0.5)
score => $func,     ##-- selected scoring function qw(f fm lf lfm mi mi3 ld ll)
milf => \%milf_12,  ##-- score: mutual information * logFreq a la Wortprofil; requires compile_milf()
mi1 => \%mi1_12,    ##-- score: mutual information; requires compile_mi1()
mi3 => \%mi3_12,    ##-- score: mutual information^3 a la Rychlý (2008); requires compile_mi3()
ld => \%ld_12,      ##-- score: log-dice a la Wortprofil; requires compile_ld()
ll => \%ll_12,      ##-- score: 1-sided log-likelihood a la Evert (2008); requires compile_ll()
fm => \%fm_12,      ##-- frequency per million score; requires compile_fm()
lf => \%lf_12,      ##-- log-frequency ; requires compile_lf()
lfm => \%lfm_1,     ##-- log-frequency per million; requires compile_lfm()
clone
$prf2 = $prf->clone();
$prf2 = $prf->clone($keep_compiled)

clones the profile $prf. if $keep_score is true, compiled data is cloned too.

shadow
$prf2 = $prf->shadow();
$prf2 = $prf->shadow($keep_compiled)

shadows %$prf. if $keep_score is true, compiled data is shadowed too (all zeroes).

Basic Access

label
$label = $prf->label();

get profile label

titles
\@titles_or_undef = $prf->titles();

get item titles

scoreKeys
@keys = $prf->scoreKeys();

returns known score function keys

empty
$bool = $prf->empty();

returns true iff profile is empty

I/O: JSON

TO_JSON__table
$thingy = $obj->TO_JSON__table()

test alternative JSON format (small but slow).

TO_JSON__flat
$thingy = $obj->TO_JSON__flat()

test alternative JSON format (small but slow).

I/O: Text

See also DiaColloDB::Persistent.

saveTextHeader
undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);

prints column titles for text output.

saveTextFh
$bool = $prf->saveTextFh($fh, %opts);

save flat TAB-separated text, format:

N F1 F2 F12 SCORE LABEL ITEM2...

%opts:

label => $label,   ##-- override $prf->{label} (used by Profile::Multi), no tab-separators required
format => $fmt,    ##-- printf format for scores (default="%f")
header => $bool,   ##-- include header-row? (default=1)
hlabel => $hlabel, ##-- prefix header item-cells with $hlabel (used by Profile::Multi)

I/O: HTML

saveHtmlFile
$bool = $prf->saveHtmlFile($filename_or_handle, %opts);

Save flat HTML table data with rows of the form

N F1 F2 F12 SCORE PREFIX? ITEM2...

%opts:

table  => $bool,     ##-- include <table>..</table> ? (default=1)
body   => $bool,     ##-- include <html><body>..</html></body> ? (default=1)
header => $bool,     ##-- include header-row? (default=1)
hlabel => $hlabel,   ##-- prefix header item-cells with $hlabel (used by Profile::Multi), no '<th>..</th>' required
label  => $label,    ##-- prefix item-cells with $label (used by Profile::Multi), no '<td>..</td>' required
format => $fmt,      ##-- printf score formatting (default="%.4f")

Compilation

compile
$prf = $prf->compile($func,%opts);

compile for score-function $func, one of qw(f fm lf lfm mi1 mi3 milf ld ll); default='f' (emits a warning).

uncompile
$prf = $prf->uncompile();

un-compiles all scores for $prf

compile_f
$prf = $prf->compile_f();

just sets $prf->{score} = 'f12'

compile_lf
$prf = $prf->compile_lf();

computes log-frequency profile in $prf->{lf}; sets $prf->{score}='lf'.

compile_fm
$prf = $prf->compile_fm();

computes frequency-per-million in $prf->{fm}; sets $prf->{score}='fm'.

compile_lfm
$prf = $prf->compile_lfm(%opts);

computes log-frequency-per-million in $prf->{lfm} sets $prf->{score}='lfm'.

compile_milf
$prf = $prf->compile_milf(%opts);

formerly compile_mi()

computes MI*logF-profile in $prf->{milf} a la Rychlý (2008); sets $prf->{score}='milf'. %opts:

eps => $eps  #-- clobber $prf->{eps}
compile_mi1
$prf = $prf->compile_mi1(%opts);

computes raw pointwise-MI profile in $prf->{mi1}; sets $prf->{score}='mi1'.

compile_mi3
$prf = $prf->compile_mi3(%opts);

computes MI^3 profile in $prf->{mi3} a la Rychlý (2008); sets $prf->{score}='mi3'.

compile_ld
$prf = $prf->compile_ld(%opts);

computes log-dice profile in $prf->{ld} a la Rychlý (2008); sets $pf->{score}='ld'.

%opts:

eps => $eps  #-- clobber $prf->{eps}
compile_ll
$prf = $prf->compile_ll(%opts);

computes 1-sided log-log-likelihood ratio in $prf->{ll} a la Evert (2008); sets $pf->{score}='ll'.

%opts:

eps => $eps  #-- clobber $prf-E<gt>{eps}

Trimming

which
\@keys = $prf->which(%opts);

returns 'good' keys for trimming options %opts:

cutoff => $cutoff,  ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff
kbest  => $kbest,   ##-- retain only $kbest items
kbesta => $kbesta,  ##-- retain only $kbest items (absolute value)
return => $which,   ##-- either 'good' (default) or 'bad'
as     => $as,      ##-- 'hash' or 'array'; default='array'
trim
$prf = $prf->trim(%opts);

trim profile to contain only 'good' keys.

%opts:

kbest => $kbest,    ##-- retain only $kbest items (by score value)
kbesta => $kbesta,  ##-- retain only $kbest items (by score absolute value)
cutoff => $cutoff,  ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff
keep => $keep,      ##-- retain keys @$keep (ARRAY) or keys(%$keep) (HASH)
drop => $drop,      ##-- drop keys @$drop (ARRAY) or keys(%$drop) (HASH)

NOTE: this COULD be factored out into s.t. like $prf->trim($prf->which(%opts)), but it's about 15% faster inline.

Stringification

stringify_map
$i2s = $prf->stringify_map( $obj);
$i2s = $prf->stringify_map(\@key2str);
$i2s = $prf->stringify_map(\&key2str);
$i2s = $prf->stringify_map(\%key2str);

guts for stringify: get a map for stringification

stringify
$prf = $prf->stringify( $obj);
$prf = $prf->stringify(\@key2str)
$prf = $prf->stringify(\&key2str)
$prf = $prf->stringify(\%key2str)

stringifies profile (destructive) via $obj->i2s($key2), $key2str->($i2) or $key2str->{$i2}.

Algebraic operations

_add
$prf = $prf->_add($prf2,%opts);

adds $prf2 frequency data to $prf (destructive); implicitly un-compiles $prf.

%opts:

N  => $bool, ##-- whether to add N values (default:true)
f1 => $bool, ##-- whether to add f1 values (default:true)
add
$prf3 = $prf1->add($prf2,%opts);

returns sum of $prf1 and $prf2 frequency data (destructive). %opts: as for _add().

_sum
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
  • returns a profile representing sum of \@profiles, passing %opts to _add().

  • if called as a class method and \@profiles contains only 1 element, that element is returned

  • otherwise, \@profiles are added to the (new) object

sum
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);

returns a new profile representing sum of \@profiles; see _sum().

diff
$diff = $prf1->diff($prf2,%opts);

wraps DiaColloDB::Profile::Diff->new($prf1,$prf2,%opts).

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2016 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

REFERENCES

Didakowski, J. and Geyken, A. (2013). "From DWDS corpora to a German Word Profile – methodological problems and solutions. In: Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information". 2nd Work Report of the Academic Network "Internet Lexicography". Mannheim: Institut für Deutsche Sprache. (OPAL - Online publizierte Arbeiten zur Linguistik X/2012), S. 43-52. URL http://www.dwds.de/static/website/publications/pdf/didakowski_geyken_internetlexikografie_2012_final.pdf

Evert, S. (2008). "Corpora and collocations." In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58, pages 1212-1248. Mouton de Gruyter, Berlin. URL (extended manuscript): http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf

Kilgarriff, A. and Tugwell, D. (2002). "Sketching words". In M.-H. Corréard (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins. EURALEX, 125-137. URL http://www.kilgarriff.co.uk/Publications/2002-KilgTugwell-AtkinsFest.pdf

Rychlý, P. (2008). "A lexicographer-friendly association score". In P. Sojka and A. Horák (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing. RASLAN 2008, 6­9. URL http://www.muni.cz/research/publications/937193, http://www.fi.muni.cz/usr/sojka/download/raslan2008/13.pdf

SEE ALSO

DiaColloDB::Persistent(3pm), DiaColloDB::Profile::Diff(3pm), DiaColloDB::Profile::Multi(3pm), DiaColloDB::Profile::MultiDiff(3pm), DiaColloDB::Relation(3pm), DiaColloDB(3pm), perl(1), ...