NAME

Lingua::TermWeight - Language-independent TermWeight calculator.

VERSION

version 0.01

SYNOPSIS

use Lingua::TermWeight;
use Lingua::TermWeight::WordSegmenter::SplitBySpace;

my $tf_idf_calc = Lingua::TermWeight->new(
  word_segmenter => Lingua::TermWeight::WordSegmenter::SplitBySpace->new,
);

my $document1 = 'Humpty Dumpty sat on a wall...';
my $document2 = 'Remember, remember, the fifth of November...';

my $tf = $tf_idf_calc->tf(document => $document1);
# TF of word "Dumpty" in $document1.
say $tf->{'Dumpty'};  # 2, if you are referring same text as mine.

my $idf = $tf_idf_calc->idf(documents => [$document1, $document2]);
say $idf->{'Dumpty'};  # log(2/1) ≒ 0.693147

my $tf_idfs = $tf_idf_calc->tf_idf(documents => [$document1, $document2]);
# TF-IDF of word "Dumpty" in $document1.
say $tf_idfs->[0]{'Dumpty'};  # 2 log(2/1) ≒ 1.386294
# Ditto. But in $document2.
say $tf_idfs->[1]{'Dumpty'};  # 0

DESCRIPTION

Quoting Wikipedia:

tf–idf, short for term frequency–inverse document frequency, is a numerical
statistic that is intended to reflect how important a word is to a document
in a collection or corpus. It is often used as a weighting factor in
information retrieval and text mining.

This module provides feature for calculating TF, IDF and TF-IDF.

METHODS

new(word_segmenter => $segmenter)

Constructor. Takes 1 mandatory parameter word_segmenter.

CUSTOM WORD SEGMENTER

Although this distribution bundles some language-independent word segmenter, like Lingua::TermWeight::WordSegmenter::SplitBySpace, sometimes language-specifiec word segmenters are more appropriate. You can pass a custom word segmenter object to the calculator.

The word segmenter is a plain Perl object that implements segment method. The method takes 1 positional argument $document, which is a string or a reference to string. It is expected to return an word iterator as CodeRef.

Roughly speaking, given custom word segmenter will be used like:

my $document = 'foo bar baz';

# Can be called with a reference, like |->segment(\$document)|.
# Detecting data type is callee's responsibility.
my $iter = $word_segmenter->segment($document);

while (defined(my $word = $iter->())) {
   ...
}

idf(documents => \@documents)

Calculates IDFs. Result is returned as HashRef, which the keys and values are words and corresponding IDFs respectively.

tf(document => $document | \$document [, normalize => 0])

Calculates TFs. Result is returned as HashRef, which the keys and values are words and corresponding TFs respectively.

If optional parameter <normalize> is set true, the TFs are devided by the number of words in the $document. It is useful when comparing TFs with other documents.

tf_idf(documents => \@documents [, normalize => 0])

Calculates TF-IDFs. Result is returned as ArrayRef of HashRef. Each HashRef contains TF-IDF values for corresponding document.

AUTHOR

Wesley Schwengle <waterkip@cpan.org>

COPYRIGHT AND LICENSE

This is free software, licensed under:

The MIT (X11) License

To install Lingua::TermWeight, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::TermWeight

CPAN shell

perl -MCPAN -e shell
install Lingua::TermWeight

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

NAME

VERSION

SYNOPSIS

DESCRIPTION

METHODS

new(word_segmenter => $segmenter)

CUSTOM WORD SEGMENTER

idf(documents => \@documents)

tf(document => $document | \$document [, normalize => 0])

tf_idf(documents => \@documents [, normalize => 0])

SEE ALSO

Fork of Lingua::TFIDF

AUTHOR

COPYRIGHT AND LICENSE

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

VERSION

SYNOPSIS

DESCRIPTION

METHODS

new(word_segmenter => $segmenter)

CUSTOM WORD SEGMENTER

idf(documents => \@documents)

tf(document => $document | \$document [, normalize => 0])

tf_idf(documents => \@documents [, normalize => 0])

SEE ALSO

Fork of Lingua::TFIDF

AUTHOR

COPYRIGHT AND LICENSE

Module Install Instructions

Keyboard Shortcuts