Changes for version 0.06

  • Fixed a bug which resulted in incorrect probabilities in NaiveBayes categorize() calculations.
  • Threshold for Naive Bayes categorizer is now a settable parameter, letting you tune performance to balance precision and recall to suit your needs. Default threshold is 0.3 (used to be fixed at 0.5).
  • Added the precision() and recall() methods, which are another set of measures of how good a categorizer is.
  • Wrote documentation for the VectorBased superclass - it was previously vestigial docs from the kNN module (oops).
  • No changes made to the kNN categorizer - however, the precision and recall scores below show that clearly some changes are needed. The main problem is the setting of thresholds, and I've done some work in this area that's already improved scores, but it's not ready yet.
  • Current scores on the drmath-1.00 corpus with features_kept => 0.1:
    • Summary *************************************
    • Name miR miP miF1 error time *
    • 01-NaiveBayes: 0.226 0.280 0.239 0.018 79 sec * <- threshold=0.3
    • 01-NaiveBayes: 0.161 0.213 0.176 0.017 93 sec * <- threshold=0.5
    • 02-kNN: 0.650 0.109 0.178 0.105 2069 sec *
    • miR = micro-avg. recall miP = micro-avg. precision *
    • miF = micro-avg. F1 error = micro-avg. error rate *

Modules

Automatically categorize documents based on content
Automate and compare AI::Categorize modules
Naive Bayes Algorithm For AI::Categorize
Base class for other algorithms
k-Nearest-Neighbor Algorithm For AI::Categorize

Provides

in Categorize.pm
in Categorize.pm

Examples