NAME

AI::Categorize::Evaluate - Automate and compare AI::Categorize modules

SYNOPSIS

use AI::Categorize::Evaluate;
my $e = new AI::Categorize::Evaluate
  (
   'packages'     => ['AI::Categorize::NaiveBayes','AI::Categorize::kNN'],
   'training_set' => 'text_dir',
   'test_set'     => 'test_dir',
   'categories'   => 'categories.txt',
   'stopwords'    => [qw(the a of to is that you for and)],
   'data_dir'     => 'data',
  );

$e->parse_training_data;
$e->crunch;
$e->categorize_test_set;

DESCRIPTION

This module helps facilitate automated testing and comparison of AI::Categorize modules. It can be used to compare the speed of execution of various stages of the categorizers, and/or to compare the results to see which module is more accurate at categorizing various documents.

METHODS

new()

This method creates a new AI::Categorize::Evaluate object. Several parameters may be passed to the new() method as key/value pairs:

  • packages

    Required. A list reference containing the names of the packages you wish to load and evaluate. The modules will be automatically loaded by searching @INC.

  • training_set

    Required for parse_training_data(). This parameter specifies the directory in which the training documents may be found.

  • test_set

  • test_size

    Required for categorize_test_set(), optional for parse_training_data(). These parameters specify where to find the test documents that will be used to evaluate the performance of the categorizers. test_set simply specifies the directory that contains the test documents. Alternatively, test_size can be used to select a certain number of documents at random from the training set (specified with the training_set parameter). That number can be given either as a simple integer, or as a figure like 5% to use a certain percentage of the training documents.

  • categories

    Required for parse_training_data() and categorize_test_set() methods. This parameter specifies the location of an existing text file which contains the mapping between categories and documents. It should contain category information for all the training documents and all the test documents. The format of the file is as follows:

    document1     category1   category2   category3  ...
    document2     category3   category7   category4  ...
    .
    .
    .

    The amount of whitespace separating document and category names is arbitrary. Because of the format of the file, whitespace is not allowed in the document or category names (if this becomes a problem, perhaps Text::CSV could be used in the future).

  • stopwords

    Optional (default []). A list of words to ignore when parsing and categorizing documents. This will be passed to the stopwords() method of the individual categorizers.

  • data_dir

    Optional (default '.'). Specifies the directory in which large data files will be created during the evaluation process.

parse_training_data()

Reads all the training documents and feeds them to the categorizers.

crunch()

categorize_test_set()

AUTHOR

Ken Williams, ken@forum.swarthmore.edu

COPYRIGHT

Copyright 2000-2001 Ken Williams. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perl(1). AI::Categorize(3)