NAME

AI::Categorize::NaiveBayes - Naive Bayes Algorithm For AI::Categorize

SYNOPSIS

use AI::Categorize::NaiveBayes;
my $c = AI::Categorize::NaiveBayes->new;
my $c = AI::Categorize::NaiveBayes->new(load_data => 'filename');

# See AI::Categorize for more details

DESCRIPTION

This is an implementation of the Naive Bayes decision-making algorithm, applied to the task of document categorization (as defined by the AI::Categorize module). See AI::Categorize for a complete description of the interface.

METHODS

This class inherits from the AI::Categorize class, so all of its methods are available unless explicitly mentioned here.

new()

The new() method accepts several parameters that help determine the behavior of the categorizer.

  • features_kept

    This parameter determines what portion of the features (words) from the training documents will be kept and what features will be discarded. The parameter is a number between 0 and 1. The default is 0.2, indicating that 20% of the features will be kept. To determine which features should be kept, we use the document-frequency criterion, in which we keep the features that appear in the greatest number of training documents. This algorithm is simple to implement and reasonably effective.

    To keep all features, pass a features_kept parameter of 0.

THEORY

Bayes' Theorem is a way of inverting a conditional probability. It states:

          P(y|x) P(x)
P(x|y) = -------------
             P(y)

The notation P(x|y) means "the probability of x given y." See also "/forum.swarthmore.edu/dr.math/problems/battisfore.03.22.99.html"" in "http: for a simple but complete example of Bayes' Theorem.

In this case, we want to know the probability of a given category given a certain string of words in a document, so we have:

                  P(words | cat) P(cat)
P(cat | words) = --------------------
                         P(words)

We have applied Bayes' Theorem because P(cat | words) is a difficult quantity to compute directly, but P(words | cat) and P(cat) are accessible (see below).

The greater the expression above, the greater the probability that the given document belongs to the given category. So we want to find the maximum value. We write this as

                               P(words | cat) P(cat)
Best category =   ArgMax      -----------------------
                 cat in cats          P(words)

Since P(words) doesn't change over the range of categories, we can get rid of it. That's good, because we didn't want to have to compute these values anyway. So our new formula is:

Best category =   ArgMax      P(words | cat) P(cat)
                 cat in cats

Finally, we note that if w1, w2, ... wn are the words in the document, then this expression is equivalent to:

Best category =   ArgMax      P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
                 cat in cats

That's the formula I use in my document categorization code. The last step is the only non-rigorous one in the derivation, and this is the "naive" part of the Naive Bayes technique. It assumes that the probability of each word appearing in a document is unaffected by the presence or absence of each other word in the document. We assume this even though we know this isn't true: for example, the word "iodized" is far more likely to appear in a document that contains the word "salt" than it is to appear in a document that contains the word "subroutine". Luckily, as it turns out, making this assumption even when it isn't true may have little effect on our results, as the following paper by Pedro Domingos argues: "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:

CALCULATIONS

The various probabilities used in the above calculations are found directly from the training documents. For instance, if there are 5000 total tokens (words) in the "sports" training documents and 200 of them are the word "curling", then P(curling|sports) = 200/5000 = 0.04 . If there are 10,000 total tokens in the training corpus and 5,000 of them are in documents belonging to the category "sports", then P(sports) = 5,000/10,000 = 0.5> .

Because the probabilities involved are often very small and we multiply many of them together, the result is often a tiny tiny number. This could pose problems of floating-point underflow, so instead of working with the actual probabilities we work with the logarithms of the probabilities. This also speeds up various calculations in the categorize() method.

TO DO

More work on the confidence scores - right now the winning category tends to dominate the scores overwhelmingly, when the scores should probably be more evenly distributed.

AUTHOR

Ken Williams, ken@forum.swarthmore.edu

COPYRIGHT

Copyright 2000-2001 Ken Williams. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

AI::Categorize(3)

"A re-examination of text categorization methods" by Yiming Yang http://www.cs.cmu.edu/~yiming/publications.html

"On the Optimality of the Simple Bayesian Classifier under Zero-One Loss" by Pedro Domingos "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:

A simple but complete example of Bayes' Theorem from Dr. Math "/www.mathforum.com/dr.math/problems/battisfore.03.22.99.html"" in "http: