NAME

AI::Categorize::VectorBased - Base class for other algorithms

SYNOPSIS

package Some::Other::Categorizer;
use AI::Categorize::VectorBased;
@ISA = qw(AI::Categorize::VectorBased);
...

DESCRIPTION

This class implements a few things that vector-based approaches to document categorization may need. It's not a complete categorization class in itself, but it can function as the parent for classes like AI::Categorize::kNN and AI::Categorize::SVM.

The rest of this document describes some of the implementation details of this class. Again, this is not useful in itself for categorization, but rather describes the shared interface between the parent and child classes.

METHODS

The following methods are provided.

  • dot_product(\%hash1, \%hash2)

    Treats %hash1 and %hash2 as vectors, where the keys represent the vector coordinates and the values represent the coordinate values, and returns the dot product of the two vectors.

    For instance, if %hash1 contains (x=>4, y=>5) and %hash2 contains (x=>2, y=>7), then dot_product(\%hash1, \%hash2) will return 4*2+5*7, i.e. 43. If any keys are present in one hash but not the other, they will be treated as if they have the value zero in the hash where they are nonexistant. So if %hash1 contains (x=>4, y=>5) and %hash2 contains (y=>1, z=>6), then dot_product(\%hash1, \%hash2) will return 5*1, i.e. 5.

    Perl is actually pretty good at doing dot products, because the intersection of the set of keys of two hashes can be found very quickly.

  • norm(\%hash)

    Returns the Euclidean norm of the values of %hash, i.e. sqrt(sum(values %hash).

  • normalize(\%hash)

    Divides each value in %hash by norm(\%hash).

  • trim_features($target)

    Reduces the number of features (words) considered in the training data. We try to find the "best" features, i.e. the ones that will help us the most when we try to categorize documents later. Right now we just use a "Document Frequency" criterion, which means we keep the features that appear in the most documents. This is surprisingly reasonable considering its simplicity, as shown in Yiming Yang's paper "A Comparative Study on Feature Selection in Text Categorization" (http://www-2.cs.cmu.edu/~yiming/publications.html).

AUTHOR

Ken Williams, ken@forum.swarthmore.edu

COPYRIGHT

Copyright 2000-2001 Ken Williams. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

AI::Categorize(3)

1 POD Error

The following errors were encountered while parsing the POD:

Around line 145:

You forgot a '=back' before '=head1'