NAME
AI::Categorize::VectorBased - Base class for other algorithms
SYNOPSIS
package Some::Other::Categorizer;
use AI::Categorize::VectorBased;
@ISA = qw(AI::Categorize::VectorBased);
...
DESCRIPTION
This class implements a few things that vector-based approaches to document categorization may need. It's not a complete categorization class in itself, but it can function as the parent for classes like AI::Categorize::kNN
and AI::Categorize::SVM
.
The rest of this document describes some of the implementation details of this class. Again, this is not useful in itself for categorization, but rather describes the shared interface between the parent and child classes.
METHODS
The following methods are provided.
dot_product(\%hash1, \%hash2)
Treats
%hash1
and%hash2
as vectors, where the keys represent the vector coordinates and the values represent the coordinate values, and returns the dot product of the two vectors.For instance, if
%hash1
contains(x=>4, y=>5)
and%hash2
contains(x=>2, y=>7)
, thendot_product(\%hash1, \%hash2)
will return4*2+5*7
, i.e.43
. If any keys are present in one hash but not the other, they will be treated as if they have the value zero in the hash where they are nonexistant. So if%hash1
contains(x=>4, y=>5)
and%hash2
contains(y=>1, z=>6)
, thendot_product(\%hash1, \%hash2)
will return5*1
, i.e.5
.Perl is actually pretty good at doing dot products, because the intersection of the set of keys of two hashes can be found very quickly.
norm(\%hash)
Returns the Euclidean norm of the values of
%hash
, i.e.sqrt(sum(values %hash)
.normalize(\%hash)
Divides each value in
%hash
bynorm(\%hash)
.trim_features($target)
Reduces the number of features (words) considered in the training data. We try to find the "best" features, i.e. the ones that will help us the most when we try to categorize documents later. Right now we just use a "Document Frequency" criterion, which means we keep the features that appear in the most documents. This is surprisingly reasonable considering its simplicity, as shown in Yiming Yang's paper "A Comparative Study on Feature Selection in Text Categorization" (http://www-2.cs.cmu.edu/~yiming/publications.html).
AUTHOR
Ken Williams, ken@forum.swarthmore.edu
COPYRIGHT
Copyright 2000-2001 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
AI::Categorize(3)
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 145:
You forgot a '=back' before '=head1'