Document.pod - metacpan.org


            
              —
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
              
=head1 NAME
  Text::Document - a text document subject to statistical analysis
=head1 SYNOPSIS
  my $t = Text::Document->new();
  $t->AddContent( 'foo bar baz' );
  $t->AddContent( 'foo barbaz; ' );
  my @freqList = $t->KeywordFrequency();
  my $u = Text::Document->new();
  ...
  my $sj = $t->JaccardSimilarity( $u );
  my $sc = $t->CosineSimilarity( $u );
  my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );
=head1 DESCRIPTION
C<Text::Document> allows to perform simple
Information-Retrieval-oriented statistics on pure-text documents.
Text can be added in chunks, so that the document may be
incrementally built, for instance by a class like
C<HTML::Parser>.
A simple algorithm splits the text into terms; the algorithm
may be redefined by subclassing and redefining C<ScanV>.
The C<KeywordFrequency> function computes term frequency
over the whole document.
=head1 FORESEEN REUSE
The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package).  In the
latter case the methods which are foreseen to be redefined are
those ending with a C<V> suffix.  Redefining other methods
will require greater attention.
=head1 CLASS METHODS
=head2 new
The creator method.  The optional arguments are in the
I<(key,value)> form and allow to specify whether
all keywords are trasformed to lowercase (default) and
whether the string representation (C<WriteToString>)
will be compressed (default).
  my $d = Text::Document->new();
  my $dNotCompressed = Text::Document( compressed => 0 );
  my $dPreserveCase = Text::Document( lowercase => 0 );
=head2 NewFromString
Take a string written by C<WriteToString> (see below)
and create a new C<Text::Document> with the same contents;
call C<die> whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.
  my $b = Text::Document::NewFromString( $str );
The return value is a blessed reference; put in another way,
this is an alternative contructor.
The string should have been written by C<WriteToString>; 
you may of course tweak the string contents, but
at this point you're entirely on you own.
=head1 INSTANCE METHODS
=head2 AddContent
Used as
  $d->AddContent( 'foo bar baz foo9' );
  $d->AddContent( 'mary had a little lamb' );
Successive calls accumulate content; there is currently no way
of resetting the content to zero.
=head2 Terms
Returns a list of all distinct terms in the document, in no
particular order.
=head2 Occurrences
Returns the number of occurrences of a given term.
  $d->AddContent( 'foo baz bar foo foo');
  my $n = $d->Occurrences( 'foo' ); # now $n is 3
=head2 ScanV
Scan a string and return a list of terms.
Called internally as:
  my @terms = $self->ScanV( $text );
=head2 KeywordFrequency
Returns a reference list of pairs I<[term,frequency]>, sorted by
ascending frequency.
  my $listRef = $d->KeywordFrequency();
  foreach my $pair (@{$listRef}){
        my ($term,$frequency) = @{$pair};
        ...
  }
Terms in the document are sampled and their frequencies of occurrency
are sorted in ascending order;
finally, the list is returned to the user.
=head2 WriteToString
Convert the document (actually, some parameters
and the term counters) into a string which can be saved and
later restored with C<NewFromString>.
  my $str = $d->WriteToString();
The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.
Whenever possible, C<Compress::Zlib> is used in order to
compress the bit vector in the most efficient way.
On systems without C<Compress::Zlib>, the bit string is
saved uncompressed.
=head2 JaccardSimilarity
Compute the Jaccard measure of document similarity, which is defined
as follows: given two documents I<D> and I<E>, let I<Ds> and I<Es> be the set
of terms occurring in I<D> and  I<E>, respectively. Define I<S> as the
intersection of I<Ds> and I<Es>, and I<T> as their union. Then
the Jaccerd  similarity is the the number of  elements
of I<S> divided by the number of elements of I<T>.
It is called as follows:
  my $sim = $d->JaccardSimilarity( $e );
If neither document has any terms the result is undef (a rare evenience).
Otherwise the similarity is a real number between 0.0 (no terms in common)
and 1.0 (all terms in common).
=head2 CosineSimilarity
Compute the cosine similarity between two documents I<D> and
I<E>.
Let I<Ds> and I<Es> be the set
of terms occurring in I<D> and  I<E>, respectively. Define I<T> as the
union of I<Ds> and I<Es>, and let I<ti> be the I<i>-th element of I<T>.
Then the term vectors of I<D> and  I<E> are
  Dv = (nD(t1), nD(t2), ..., nD(tN))
  Ev = (nE(t1), nE(t2), ..., nE(tN))
where nD(ti) is the  number of occurrences of term ti in I<D>,
and nE(ti) the same for I<E>.
Now we are at last ready to define the cosine similarity I<CS>:
  CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))
Here (... , ...) is the scalar product and Norm is the Euclidean
norm (square root of the sum of squares).
C<CosineSimilarity> is called as
   $sim = $d->CosineSimilarity( $e );
It is C<undef> if either I<D> or I<E> have no occurrence of any term.
Otherwise, it is a number between 0.0 and 1.0. Since term occurrences
are always non-negative, the cosine is obviously always non-negative.
=head2 WeightedCosineSimilarity
Compute the weighted cosine similarity between two documents I<D> and
I<E>.
In the setting of C<CosineSimilarity>, the 
term vectors of I<D> and  I<E> are
  Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN)
  Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)
The weights are nonnegative real values; each term has associated
a weight. To achieve generality, weights may be defined
using a function, like:
  my $wcs = $d->WeightedCosineSimilarity(
        $e,
        \&function,
        $rock
  );
The C<function> will be called as follows:
  my $weight = function( $rock, 'foo' );
C<$rock> is a 'constant' object used for passing a I<context>
to the function.
For instance, a common way of defining weights is the IDF (inverse
document frequency), which is defined in L<Text::DocumentCollection>.
In this context, you can weigh terms with their IDF as
follows:
  $sim = $c->WeightedCosineSimilarity(
        $d,
        \&Text::DocumentCollection::IDF,
        $collection
  );
C<WeightedCosineSimilarity> will call
  $collection->IDF( 'foo' );
which is what we expect.
Actually, we should return the square root of IDF, but this
detail is not necessary here.
=head1 AUTHORS
  spinellia@acm.org (Andrea Spinelli)
  walter@humans.net (Walter Vannini)
=head1 HISTORY
  2001-11-02 - initial revision
  2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie>
=head DISCARDED CHOICES
We did not use C<Storable>, because we wanted to fine-tune
compression and version compatibility.  However, this
choice may be easily reversed redefining WriteToString and
NewFromString.
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)