NAME

DiaColloDB::Profile - diachronic collocation db, (co-)frequency profile

SYNOPSIS

##========================================================================
## PRELIMINARIES

use DiaColloDB::Profile;

##========================================================================
## Constructors etc.

$prf = CLASS_OR_OBJECT->new(%args);
$prf2 = $prf->clone();
$prf2 = $prf->shadow();

##========================================================================
## Basic Access

$label = $prf->label();
\@titles_or_undef = $prf->titles();
@keys = $prf->scoreKeys();
$bool = $prf->empty();

##========================================================================
## I/O: JSON

 *TO_JSON = \&TO_JSON__table;

##========================================================================
## I/O: Text

undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);
$bool = $prf->saveTextFh($fh, %opts);

##========================================================================
## I/O: HTML

$bool = $prf->saveHtmlFile($filename_or_handle, %opts);

##========================================================================
## Compilation

$prf = $prf->compile($func,%opts);
$prf = $prf->uncompile();
$prf = $prf->compile_f();
$prf = $prf->compile_lf();
$prf = $prf->compile_lfm();
$prf = $prf->compile_fm();
$prf = $prf->compile_mi(%opts);
$prf = $prf->compile_mi3(%opts);
$prf = $prf->compile_ld(%opts);
$prf = $prf->compile_ll(%opts);

##========================================================================
## Trimming

\@keys = $prf->which(%opts);
$prf   = $prf->trim(%opts);

##========================================================================
## Stringification

$i2s = $prf->stringify_map( $obj);
$prf = $prf->stringify( $obj);

##========================================================================
## Algebraic operations

$prf = $prf->_add($prf2,%opts);
$prf3 = $prf1->add($prf2,%opts);
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);
$diff = $prf1->diff($prf2,%opts);

DESCRIPTION

DiaColloDB::Profile is a class for representing low-level collocate frequency profile data for a single date-slice as retrieved e.g. from a native index or DDC back-end. It includes methods for compiling profile scores via several score functions (e.g. frequency, pointwise mi * log-frequency, log Dice), k-best trimming, stringification, basic algebraic manipulation, and serialization (text, HTML, or JSON).

Globals & Constants

Variable: @ISA

DiaColloDB::Profile inherits from DiaColloDB::Persistent.

Constructors etc.

new
$prf = CLASS_OR_OBJECT->new(%args);

%args, object structure:

label => $label,    ##-- string label (used by Multi; undef for none(default))
N   => $N,          ##-- total marginal relation frequency
f1  => $f1,         ##-- total marginal frequency of target word(s)
f2  => \%f2,        ##-- total marginal frequency of collocates: ($i2=>$f2, ...)
f12 => \%f12,       ##-- collocation frequencies, %f12 = ($i2=>$f12, ...)
titles => \@titles, ##-- item group titles (default:undef: unknown)
##
eps => $eps,        ##-- smoothing constant (default=0.5)
score => $func,     ##-- selected scoring function ('f12', 'mi', or 'ld')
mi => \%mi12,       ##-- score: mutual information * logFreq a la Wortprofil; requires compile_mi()
mi3 => \%mi312,     ##-- score: mutual information^3 a la Rychlý (2008); requires compile_mi3()
ld => \%ld12,       ##-- score: log-dice a la Wortprofil; requires compile_ld()
ll => \%ll12,       ##-- score: 1-sided log-likelihood a la Evert (2008); requires compile_ll()
fm => \%fm12,       ##-- frequency per million score; requires compile_fm()
lf => \%lf12,       ##-- log-frequency ; requires compile_lf()
lfm => \%lfm12,     ##-- log-frequency per million; requires compile_lfm()
clone
$prf2 = $prf->clone();
$prf2 = $prf->clone($keep_compiled)

clones the profile $prf. if $keep_score is true, compiled data is cloned too.

shadow
$prf2 = $prf->shadow();
$prf2 = $prf->shadow($keep_compiled)

shadows %$prf. if $keep_score is true, compiled data is shadowed too (all zeroes).

Basic Access

label
$label = $prf->label();

get profile label

titles
\@titles_or_undef = $prf->titles();

get item titles

scoreKeys
@keys = $prf->scoreKeys();

returns known score function keys

empty
$bool = $prf->empty();

returns true iff profile is empty

I/O: JSON

TO_JSON__table
$thingy = $obj->TO_JSON__table()

test alternative JSON format (small but slow).

TO_JSON__flat
$thingy = $obj->TO_JSON__flat()

test alternative JSON format (small but slow).

I/O: Text

See also DiaColloDB::Persistent.

saveTextHeader
undef = $CLASS_OR_OBJECT->saveTextHeader($fh, hlabel=>$hlabel, titles=>\@titles);

prints column titles for text output.

saveTextFh
$bool = $prf->saveTextFh($fh, %opts);

save flat TAB-separated text, format:

N F1 F2 F12 SCORE LABEL ITEM2...

%opts:

label => $label,   ##-- override $prf->{label} (used by Profile::Multi), no tab-separators required
format => $fmt,    ##-- printf format for scores (default="%f")
header => $bool,   ##-- include header-row? (default=1)
hlabel => $hlabel, ##-- prefix header item-cells with $hlabel (used by Profile::Multi)

I/O: HTML

saveHtmlFile
$bool = $prf->saveHtmlFile($filename_or_handle, %opts);

Save flat HTML table data with rows of the form

N F1 F2 F12 SCORE PREFIX? ITEM2...

%opts:

table  => $bool,     ##-- include <table>..</table> ? (default=1)
body   => $bool,     ##-- include <html><body>..</html></body> ? (default=1)
header => $bool,     ##-- include header-row? (default=1)
hlabel => $hlabel,   ##-- prefix header item-cells with $hlabel (used by Profile::Multi), no '<th>..</th>' required
label  => $label,    ##-- prefix item-cells with $label (used by Profile::Multi), no '<td>..</td>' required
format => $fmt,      ##-- printf score formatting (default="%.4f")

Compilation

compile
$prf = $prf->compile($func,%opts);

compile for score-function $func, one of qw(f fm mi ld); default='f'

uncompile
$prf = $prf->uncompile();

un-compiles all scores for $prf

compile_f
$prf = $prf->compile_f();

just sets $prf->{score} = 'f12'

compile_lf
$prf = $prf->compile_lf();

computes log-frequency profile in $prf->{lf}; sets $prf->{score}='lf'.

compile_fm
$prf = $prf->compile_fm();

computes frequency-per-million in $prf->{fm}; sets $prf->{score}='fm'.

compile_lfm
$prf = $prf->compile_lfm(%opts);

computes log-frequency-per-million in $prf->{lfm} sets $prf->{score}='lfm'.

compile_mi
$prf = $prf->compile_mi(%opts);

computes MI*logF-profile in $prf->{mi} a la Rychlý (2008); sets $prf->{score}='mi'. %opts:

eps => $eps  #-- clobber $prf->{eps}
compile_mi3
$prf = $prf->compile_mi3(%opts);

computes MI^3 profile in $prf->{mi} a la Rychlý (2008); sets $prf->{score}='mi3'.

compile_ld
$prf = $prf->compile_ld(%opts);

computes log-dice profile in $prf->{ld} a la Rychlý (2008); sets $pf->{score}='ld'.

%opts:

eps => $eps  #-- clobber $prf->{eps}
compile_ll
$prf = $prf->compile_ll(%opts);

computes 1-sided log-log-likelihood ratio in $prf->{ll} a la Evert (2008); sets $pf->{score}='ll'.

%opts:

eps => $eps  #-- clobber $prf-E<gt>{eps}

Trimming

which
\@keys = $prf->which(%opts);

returns 'good' keys for trimming options %opts:

cutoff => $cutoff,  ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff
kbest  => $kbest,   ##-- retain only $kbest items
kbesta => $kbesta,  ##-- retain only $kbest items (absolute value)
return => $which,   ##-- either 'good' (default) or 'bad'
as     => $as,      ##-- 'hash' or 'array'; default='array'
trim
$prf = $prf->trim(%opts);

trim profile to contain only 'good' keys.

%opts:

kbest => $kbest,    ##-- retain only $kbest items (by score value)
kbesta => $kbesta,  ##-- retain only $kbest items (by score absolute value)
cutoff => $cutoff,  ##-- retain only items with $prf->{$prf->{score}}{$item} >= $cutoff
keep => $keep,      ##-- retain keys @$keep (ARRAY) or keys(%$keep) (HASH)
drop => $drop,      ##-- drop keys @$drop (ARRAY) or keys(%$drop) (HASH)

NOTE: this COULD be factored out into s.t. like $prf->trim($prf->which(%opts)), but it's about 15% faster inline.

Stringification

stringify_map
$i2s = $prf->stringify_map( $obj);
$i2s = $prf->stringify_map(\@key2str);
$i2s = $prf->stringify_map(\&key2str);
$i2s = $prf->stringify_map(\%key2str);

guts for stringify: get a map for stringification

stringify
$prf = $prf->stringify( $obj);
$prf = $prf->stringify(\@key2str)
$prf = $prf->stringify(\&key2str)
$prf = $prf->stringify(\%key2str)

stringifies profile (destructive) via $obj->i2s($key2), $key2str->($i2) or $key2str->{$i2}.

Algebraic operations

_add
$prf = $prf->_add($prf2,%opts);

adds $prf2 frequency data to $prf (destructive); implicitly un-compiles $prf.

%opts:

N  => $bool, ##-- whether to add N values (default:true)
f1 => $bool, ##-- whether to add f1 values (default:true)
add
$prf3 = $prf1->add($prf2,%opts);

returns sum of $prf1 and $prf2 frequency data (destructive). %opts: as for _add().

_sum
$psum = $CLASS_OR_OBJECT->_sum(\@profiles,%opts);
  • returns a profile representing sum of \@profiles, passing %opts to _add().

  • if called as a class method and \@profiles contains only 1 element, that element is returned

  • otherwise, \@profiles are added to the (new) object

sum
$psum = $CLASS_OR_OBJECT->sum(\@profiles,%opts);

returns a new profile representing sum of \@profiles; see _sum().

diff
$diff = $prf1->diff($prf2,%opts);

wraps DiaColloDB::Profile::Diff->new($prf1,$prf2,%opts).

AUTHOR

Bryan Jurish <moocow@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2015-2016 by Bryan Jurish

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.

REFERENCES

Didakowski, J. and Geyken, A. (2013). "From DWDS corpora to a German Word Profile – methodological problems and solutions. In: Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information". 2nd Work Report of the Academic Network "Internet Lexicography". Mannheim: Institut für Deutsche Sprache. (OPAL - Online publizierte Arbeiten zur Linguistik X/2012), S. 43-52. URL http://www.dwds.de/static/website/publications/pdf/didakowski_geyken_internetlexikografie_2012_final.pdf

Evert, S. (2008). "Corpora and collocations." In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58, pages 1212-1248. Mouton de Gruyter, Berlin. URL (extended manuscript): http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf

Kilgarriff, A. and Tugwell, D. (2002). "Sketching words". In M.-H. Corréard (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins. EURALEX, 125-137. URL http://www.kilgarriff.co.uk/Publications/2002-KilgTugwell-AtkinsFest.pdf

Rychlý, P. (2008). "A lexicographer-friendly association score". In P. Sojka and A. Horák (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing. RASLAN 2008, 6­9. URL "/www.muni.cz/research/publications/937193 , http://www.fi.muni.cz/usr/sojka/download/raslan2008/13.pdf" in http:

SEE ALSO

DiaColloDB::Persistent(3pm), DiaColloDB::Profile::Diff(3pm), DiaColloDB::Profile::Multi(3pm), DiaColloDB::Profile::MultiDiff(3pm), DiaColloDB::Relation(3pm), DiaColloDB(3pm), perl(1), ...

1 POD Error

The following errors were encountered while parsing the POD:

Around line 5:

This document probably does not appear as it should, because its "=encoding ut8" line calls for an unsupported encoding. [Encode.pm v3.17's supported encodings are: 7bit-jis AdobeStandardEncoding AdobeSymbol AdobeZdingbat ascii ascii-ctrl big5-eten big5-hkscs cp1006 cp1026 cp1047 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 cp37 cp424 cp437 cp500 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp858 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp875 cp932 cp936 cp949 cp950 dingbats euc-cn euc-jp euc-kr gb12345-raw gb2312-raw gsm0338 hp-roman8 hz iso-2022-jp iso-2022-jp-1 iso-2022-kr iso-8859-1 iso-8859-10 iso-8859-11 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-ir-165 jis0201-raw jis0208-raw jis0212-raw johab koi8-f koi8-r koi8-u ksc5601-raw MacArabic MacCentralEurRoman MacChineseSimp MacChineseTrad MacCroatian MacCyrillic MacDingbats MacFarsi MacGreek MacHebrew MacIcelandic MacJapanese MacKorean MacRoman MacRomanian MacRumanian MacSami MacSymbol MacThai MacTurkish MacUkrainian MIME-B MIME-Header MIME-Header-ISO_2022_JP MIME-Q nextstep null posix-bc shiftjis symbol UCS-2BE UCS-2LE UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UTF-7 utf-8-strict utf8 viscii]

Couldn't do =encoding ut8: This document probably does not appear as it should, because its "=encoding ut8" line calls for an unsupported encoding. [Encode.pm v3.17's supported encodings are: 7bit-jis AdobeStandardEncoding AdobeSymbol AdobeZdingbat ascii ascii-ctrl big5-eten big5-hkscs cp1006 cp1026 cp1047 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256 cp1257 cp1258 cp37 cp424 cp437 cp500 cp737 cp775 cp850 cp852 cp855 cp856 cp857 cp858 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp875 cp932 cp936 cp949 cp950 dingbats euc-cn euc-jp euc-kr gb12345-raw gb2312-raw gsm0338 hp-roman8 hz iso-2022-jp iso-2022-jp-1 iso-2022-kr iso-8859-1 iso-8859-10 iso-8859-11 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-ir-165 jis0201-raw jis0208-raw jis0212-raw johab koi8-f koi8-r koi8-u ksc5601-raw MacArabic MacCentralEurRoman MacChineseSimp MacChineseTrad MacCroatian MacCyrillic MacDingbats MacFarsi MacGreek MacHebrew MacIcelandic MacJapanese MacKorean MacRoman MacRomanian MacRumanian MacSami MacSymbol MacThai MacTurkish MacUkrainian MIME-B MIME-Header MIME-Header-ISO_2022_JP MIME-Q nextstep null posix-bc shiftjis symbol UCS-2BE UCS-2LE UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE UTF-7 utf-8-strict utf8 viscii]