=head1 NAME Text::Document - a text document subject to statistical analysis =head1 SYNOPSIS my $t = Text::Document->new(); $t->AddContent( 'foo bar baz' ); $t->AddContent( 'foo barbaz; ' ); my @freqList = $t->KeywordFrequency(); my $u = Text::Document->new(); ... my $sj = $t->JaccardSimilarity( $u ); my $sc = $t->CosineSimilarity( $u ); my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock ); =head1 DESCRIPTION C<Text::Document> allows to perform simple Information-Retrieval-oriented statistics on pure-text documents. Text can be added in chunks, so that the document may be incrementally built, for instance by a class like C<HTML::Parser>. A simple algorithm splits the text into terms; the algorithm may be redefined by subclassing and redefining C<ScanV>. The C<KeywordFrequency> function computes term frequency over the whole document. =head1 FORESEEN REUSE The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a C<V> suffix. Redefining other methods will require greater attention. =head1 CLASS METHODS =head2 new The creator method. The optional arguments are in the I<(key,value)> form and allow to specify whether all keywords are trasformed to lowercase (default) and whether the string representation (C<WriteToString>) will be compressed (default). my $d = Text::Document->new(); my $dNotCompressed = Text::Document( compressed => 0 ); my $dPreserveCase = Text::Document( lowercase => 0 ); =head2 NewFromString Take a string written by C<WriteToString> (see below) and create a new C<Text::Document> with the same contents; call C<die> whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable. my $b = Text::Document::NewFromString( $str ); The return value is a blessed reference; put in another way, this is an alternative contructor. The string should have been written by C<WriteToString>; you may of course tweak the string contents, but at this point you're entirely on you own. =head1 INSTANCE METHODS =head2 AddContent Used as $d->AddContent( 'foo bar baz foo9' ); $d->AddContent( 'mary had a little lamb' ); Successive calls accumulate content; there is currently no way of resetting the content to zero. =head2 Terms Returns a list of all distinct terms in the document, in no particular order. =head2 Occurrences Returns the number of occurrences of a given term. $d->AddContent( 'foo baz bar foo foo'); my $n = $d->Occurrences( 'foo' ); # now $n is 3 =head2 ScanV Scan a string and return a list of terms. Called internally as: my @terms = $self->ScanV( $text ); =head2 KeywordFrequency Returns a reference list of pairs I<[term,frequency]>, sorted by ascending frequency. my $listRef = $d->KeywordFrequency(); foreach my $pair (@{$listRef}){ my ($term,$frequency) = @{$pair}; ... } Terms in the document are sampled and their frequencies of occurrency are sorted in ascending order; finally, the list is returned to the user. =head2 WriteToString Convert the document (actually, some parameters and the term counters) into a string which can be saved and later restored with C<NewFromString>. my $str = $d->WriteToString(); The string begins with a header which encodes the originating package, its version, the parameters of the current instance. Whenever possible, C<Compress::Zlib> is used in order to compress the bit vector in the most efficient way. On systems without C<Compress::Zlib>, the bit string is saved uncompressed. =head2 JaccardSimilarity Compute the Jaccard measure of document similarity, which is defined as follows: given two documents I<D> and I<E>, let I<Ds> and I<Es> be the set of terms occurring in I<D> and I<E>, respectively. Define I<S> as the intersection of I<Ds> and I<Es>, and I<T> as their union. Then the Jaccerd similarity is the the number of elements of I<S> divided by the number of elements of I<T>. It is called as follows: my $sim = $d->JaccardSimilarity( $e ); If neither document has any terms the result is undef (a rare evenience). Otherwise the similarity is a real number between 0.0 (no terms in common) and 1.0 (all terms in common). =head2 CosineSimilarity Compute the cosine similarity between two documents I<D> and I<E>. Let I<Ds> and I<Es> be the set of terms occurring in I<D> and I<E>, respectively. Define I<T> as the union of I<Ds> and I<Es>, and let I<ti> be the I<i>-th element of I<T>. Then the term vectors of I<D> and I<E> are Dv = (nD(t1), nD(t2), ..., nD(tN)) Ev = (nE(t1), nE(t2), ..., nE(tN)) where nD(ti) is the number of occurrences of term ti in I<D>, and nE(ti) the same for I<E>. Now we are at last ready to define the cosine similarity I<CS>: CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev)) Here (... , ...) is the scalar product and Norm is the Euclidean norm (square root of the sum of squares). C<CosineSimilarity> is called as $sim = $d->CosineSimilarity( $e ); It is C<undef> if either I<D> or I<E> have no occurrence of any term. Otherwise, it is a number between 0.0 and 1.0. Since term occurrences are always non-negative, the cosine is obviously always non-negative. =head2 WeightedCosineSimilarity Compute the weighted cosine similarity between two documents I<D> and I<E>. In the setting of C<CosineSimilarity>, the term vectors of I<D> and I<E> are Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN) Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN) The weights are nonnegative real values; each term has associated a weight. To achieve generality, weights may be defined using a function, like: my $wcs = $d->WeightedCosineSimilarity( $e, \&function, $rock ); The C<function> will be called as follows: my $weight = function( $rock, 'foo' ); C<$rock> is a 'constant' object used for passing a I<context> to the function. For instance, a common way of defining weights is the IDF (inverse document frequency), which is defined in L<Text::DocumentCollection>. In this context, you can weigh terms with their IDF as follows: $sim = $c->WeightedCosineSimilarity( $d, \&Text::DocumentCollection::IDF, $collection ); C<WeightedCosineSimilarity> will call $collection->IDF( 'foo' ); which is what we expect. Actually, we should return the square root of IDF, but this detail is not necessary here. =head1 AUTHORS spinellia@acm.org (Andrea Spinelli) walter@humans.net (Walter Vannini) =head1 HISTORY 2001-11-02 - initial revision 2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie> =head DISCARDED CHOICES We did not use C<Storable>, because we wanted to fine-tune compression and version compatibility. However, this choice may be easily reversed redefining WriteToString and NewFromString.