Revision history for Perl extension AI::Categorize


0.07  Fri Feb 15 23:31:10 CST 2002
  - Major improvements to NaiveBayes - it now gives results on par with
    (slightly better than, actually) the Naive Bayes results in Yang's
    "Re-Examination" paper.

  - Corrected a floating-point underflow problem in NaiveBayes that
    occurred with long documents

  - When categorizing, NaiveBayes now correctly skips words that
    weren't in any training documents.
 
  - Added the threshold() accessor method to NaiveBayes

  - Fixed a crash that occurred when no stopwords were specified

  - Improved the formatting of the output for AI::Categorize::Evaluate

  - reuters-21578, features_kept => 0.1
    ******************* Summary *************************************
    *        Name         miR   miP   miF1  error      time         *
    *   01-NaiveBayes:   0.824 0.883 0.839  0.005     407 sec       *
    *****************************************************************

  - drmath-1.00,   features_kept => 0.1
    ******************* Summary *************************************
    *        Name         miR   miP   miF1  error      time         *
    *   01-NaiveBayes:   0.323 0.397 0.341  0.016     145 sec       *
    *          01-kNN:   0.636 0.144 0.220  0.078     990 sec       * <- features_kept => 0.2
    *          01-kNN:   0.606 0.149 0.223  0.073    1221 sec       * <- features_kept => 0.1
    *****************************************************************


0.06  Wed Nov  7 11:33:32 CST 2001
  - Fixed a bug which resulted in incorrect probabilities in
    NaiveBayes categorize() calculations.

  - Threshold for Naive Bayes categorizer is now a settable parameter,
    letting you tune performance to balance precision and recall to
    suit your needs.  Default threshold is 0.3 (used to be fixed at 0.5).

  - Added the precision() and recall() methods, which are another set
    of measures of how good a categorizer is.

  - Wrote documentation for the VectorBased superclass - it was
    previously vestigial docs from the kNN module (oops).

  - No changes made to the kNN categorizer - however, the precision
    and recall scores below show that clearly some changes are needed.
    The main problem is the setting of thresholds, and I've done some
    work in this area that's already improved scores, but it's not
    ready yet.

  - Current scores on the drmath-1.00 corpus with features_kept => 0.1:
    ******************* Summary *************************************
    *        Name         miR   miP   miF1  error      time         *
    *   01-NaiveBayes:   0.226 0.280 0.239  0.018      79 sec       * <- threshold=0.3
    *   01-NaiveBayes:   0.161 0.213 0.176  0.017      93 sec       * <- threshold=0.5
    *          02-kNN:   0.650 0.109 0.178  0.105    2069 sec       *
    *****************************************************************
    *  miR = micro-avg. recall         miP = micro-avg. precision   *
    *  miF = micro-avg. F1           error = micro-avg. error rate  *
    *****************************************************************


0.05 Sun Sep 30 14:41:52 CDT 2001

  - Made lots of improvements to the NaiveBayes categorizer.  It was
    so bad as to be essentially useless before.  Now it is scoring better
    in F1, accuracy, and running time than the kNN categorizer on my
    standard test corpus.  This improvement came from studying Tom
    Mitchell's excellent book "Machine Learning".

        01-NaiveBayes: F1=0.195  accuracy=0.981  time=  99 sec
               02-kNN: F1=0.169  accuracy=0.889  time=1199 sec

  - Increased the efficiency of the category map.  Added boolean
    is_in_category() and contains_document() methods.

  - Fixed a bug in the AI::Categorize::Evaluate class in which
    default arguments weren't being passed properly to the created
    classes.

  - Cleaned up the formatting of the AI::Categorize::Evaluate output,
    and added the accuracy score.

  - Fixed a small problem in kNN in which it was using k-1 similar
    documents instead of k.

  - Added an accuracy() and error() method to AI::Categorize.
    Calculates the accuracy/error over all binary category membership
    decisions.  Has the same interface as the previous F1() method.

  - Fixed the F1() method to return 1 (perfect score) when you
    correctly assign zero categories.

  - Added a cat_map() method to AI::Categorize class, which returns
    the AI::Categorize::Map object so you can query this information.

0.04 Jul 18 00:15 2001

  - Reworked the AI::Categorize::Evaluate module so that it much better 
    addresses the issue of how to specify both general info for all tests 
    and specific info for each test.  This makes it possible to test the results
    of using different initialization parameters, for instance, or the results on
    varying test sets.

  - Made some changes to the way AI::Categorize::Evaluate stores its results
    between stages of the testing.  This isn't stable yet.

  - Added a testing summary at the end of AI::Categorize::Evaluate->evaluate_test_set.

  - Created the 'drmath-1.00' corpus, which I'll use as a stable corpus for 
    benchmarking the differences various changes to the code has.  It's large, 
    so I'm not distributing it with the modules.  Write me if you want it.

  - The kNN and NaiveBayes classifiers now trim their list of corpus features 
    (words) to get rid of seldom-used features.  This can improve speed
    and quality.  Preliminary results (using F1 as a quality measure) are:
       corpus is drmath-1.00 with 12379 unique features.
        kNN using 100% of features: F1=0.180, testing time=1384 sec
        kNN using  20% of features: F1=0.178, testing time=1060 sec
        kNN using  10% of features: F1=0.180  testing time=1050 sec
        NB  using 100% of features: F1=0.037, testing time= 102 sec
        NB  using  20% of features: F1=0.041, testing time=  72 sec
        NB  using  10% of features: F1=0.039, testing time=  93 sec
    See the 'features_kept' item in the kNN and NaiveBayes docs.

  - Created the new AI::Categorize::VectorBased class, which kNN now inherits
    from, and which can be a base class for other classifiers (like SVM, hint 
    hint).

  - Started to clean up print() statements throughout the code.  They give feedback
    on training progress, but sometimes you probably don't want to see it.

  - Moved the example script 'evaluate.pl' to the new 'eg/' directory, because
    otherwise 'make install' would install it into site_perl/ .  If you installed
    previous versions of AI::Categorize, you may want to remove 'evaluate.pl'
    from your site_perl/ directory.


0.03  Tue May 22 18:01:46 CDT 2001
   - First release to CPAN
   - Added 'make test' procedure
   - Wrote docs for the major classes

0.02  May 10 01:17 2001
   - Added AI::Categorize::kNN class
   - Added AI::Categorize::Evaluate class and the evaluate.pl script

0.01  Thu Apr 12 23:42:11 2001
	- original version; created by h2xs 1.1.1.4 with options
		-XA -n AI::Categorize::NaiveBayes
        
        - Not released to CPAN.