NAME

AI::Categorize::kNN - k-Nearest-Neighbor Algorithm For AI::Categorize

SYNOPSIS

use AI::Categorize::kNN;
my $c = AI::Categorize::kNN->new;
my $c = AI::Categorize::kNN->new(load_data => 'filename');

# See AI::Categorize for more details

DESCRIPTION

This is an implementation of the k-nearest-neighbor decision-making algorithm, applied to the task of document categorization (as defined by the AI::Categorize module). The basic concept behind the algorithm is to find the k documents from the training corpus that are most similar to the given document (k is often a number like 20 or so), then use the categories of those similar documents to determine a reasonable set of categories for the given document.

"Similarity" of two documents is defined to be the cosine of the angle between the documents' word-frequency vectors. Each word-frequency vector is a many-dimensional vector whose components represent all the words present in the document, and whose component values represent the number of times each word appears in the document.

After we find the k training documents most similar to the given document (we'll call them the "similar documents"), we look up the categories of the similar documents. The appropriateness score for each category is then the sum of the scores of the similar documents belonging to that category. If any of the similar documents belongs to multiple categories, it counts for all.

Once this procedure has been followed to determine the given document's appropriateness score for each category, we check each score against a per-category threshold (learned from the training data - see "ratio_held_out" below). If the score is higher than the threshold, the given document is assigned to that category. If not, the document is not assigned to that category.

At this stage, an appropriateness score for one category is not comparable to an appropriateness score for another category. To correct this, before returning its output the categorize() methods will normalize all scores so that a score higher than 1 indicates category membership, and a score lower than 1 indicates category nonmembership. The details of this fact may change in future releases of this code.

METHODS

The AI::Categorize::kNN class inherits from the AI::Categorize class, so all of its methods are available unless explicitly mentioned here.

new()

The new() method accepts several parameters that help determine the behavior of the categorizer.

  • k

    This is the k in k-Nearest-Neigbor. It is the number of similar documents to consider during the categorize() method. The default value is 20. Experiment to find out a value that suits your needs.

  • ratio_held_out

    This is the portion of the training corpus that will be used to determine the per-category membership threshold. The default value is 0.2, which means that for each category 80% of the training documents will be parsed, then the remaining 20% will be used to determine the threshold. The threshold will be set to a value that maximizes F1 on the held out data (see "F1" in AI::Categorize).

    We require that there be at least 2 documents in the held out set for each category. If there aren't enough, some dumb default value will be used instead.

  • features_kept

    This parameter determines what portion of the features (words) from the training documents will be kept and what features will be discarded. The parameter is a number between 0 and 1. The default is 0.2, indicating that 20% of the features will be kept. To determine which features should be kept, we use the document-frequency criterion, in which we keep the features that appear in the greatest number of training documents. This algorithm is simple to implement and reasonably effective.

    To keep all features, pass a features_kept parameter of 0.

TO DO

Something seems screwy with the threshold-setting procedure. It allows more categories to be assigned than it should, as evidenced by the fact that usually the top 1 or 2 scoring categories are correct, but additional false categories are thrown in too. I think this is probably because the thresholds are set using a document set that doesn't reflect the category distribution of the training/testing copora.

AUTHOR

Ken Williams, ken@forum.swarthmore.edu

COPYRIGHT

Copyright 2000-2001 Ken Williams. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

AI::Categorize(3)

"A re-examination of text categorization methods" by Yiming Yang http://www.cs.cmu.edu/~yiming/publications.html