NAME
AI::Categorize::VectorBased - Base class for other algorithms
SYNOPSIS
use AI::Categorize::VectorBased;
@ISA = qw(AI::Categorize::VectorBased);
...
DESCRIPTION
This class implements a few things that vector-based approaches to document categorization may need. It's not a complete categorization class in itself, but it can function as the parent for classes like AI::Categorize::kNN
and AI::Categorize::SVM
.
The rest of this document describes some of the implementation details of this class. Again, this is not useful in itself for categorization, but rather describes the shared interface between the parent and child classes.
METHODS
The AI::Categorize::kNN
class inherits from the AI::Categorize
class, so all of its methods are available unless explicitly mentioned here.
new()
The new()
method accepts several parameters that help determine the behavior of the categorizer.
k
This is the
k
in k-Nearest-Neigbor. It is the number of similar documents to consider during thecategorize()
method. The default value is 20. Experiment to find out a value that suits your needs.ratio_held_out
This is the portion of the training corpus that will be used to determine the per-category membership threshold. The default value is 0.2, which means that for each category 80% of the training documents will be parsed, then the remaining 20% will be used to determine the threshold. The threshold will be set to a value that maximizes F1 on the held out data (see "F1" in AI::Categorize).
We require that there be at least 2 documents in the held out set for each category. If there aren't enough, some dumb default value will be used instead.
AUTHOR
Ken Williams, ken@forum.swarthmore.edu
COPYRIGHT
Copyright 2000-2001 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
AI::Categorize(3)
"A re-examination of text categorization methods" by Yiming Yang http://www.cs.cmu.edu/~yiming/publications.html
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 108:
You forgot a '=back' before '=head2'