NAME
AI::Categorize::Evaluate - Automate and compare AI::Categorize modules
SYNOPSIS
use AI::Categorize::Evaluate;
my $e = new AI::Categorize::Evaluate
(
'packages' => ['AI::Categorize::NaiveBayes','AI::Categorize::kNN'],
'training_set' => 'text_dir',
'test_set' => 'test_dir',
'categories' => 'categories.txt',
'stopwords' => [qw(the a of to is that you for and)],
'data_dir' => 'data',
);
$e->parse_training_data;
$e->crunch;
$e->categorize_test_set;
DESCRIPTION
This module helps facilitate automated testing and comparison of AI::Categorize modules. It can be used to compare the speed of execution of various stages of the categorizers, and/or to compare the results to see which module is more accurate at categorizing various documents.
METHODS
new()
This method creates a new AI::Categorize::Evaluate
object. Several parameters may be passed to the new()
method as key/value pairs:
packages
Required. A list reference containing the names of the packages you wish to load and evaluate. The modules will be automatically loaded by searching @INC.
training_set
Required for
parse_training_data()
. This parameter specifies the directory in which the training documents may be found.test_set
test_size
Required for
categorize_test_set()
, optional forparse_training_data()
. These parameters specify where to find the test documents that will be used to evaluate the performance of the categorizers.test_set
simply specifies the directory that contains the test documents. Alternatively,test_size
can be used to select a certain number of documents at random from the training set (specified with thetraining_set
parameter). That number can be given either as a simple integer, or as a figure like5%
to use a certain percentage of the training documents.categories
Required for
parse_training_data()
andcategorize_test_set()
methods. This parameter specifies the location of an existing text file which contains the mapping between categories and documents. It should contain category information for all the training documents and all the test documents. The format of the file is as follows:document1 category1 category2 category3 ... document2 category3 category7 category4 ... . . .
The amount of whitespace separating document and category names is arbitrary. Because of the format of the file, whitespace is not allowed in the document or category names (if this becomes a problem, perhaps Text::CSV could be used in the future).
stopwords
Optional (default []). A list of words to ignore when parsing and categorizing documents. This will be passed to the stopwords() method of the individual categorizers.
data_dir
Optional (default '.'). Specifies the directory in which large data files will be created during the evaluation process.
parse_training_data()
Reads all the training documents and feeds them to the categorizers.
crunch()
categorize_test_set()
AUTHOR
Ken Williams, ken@forum.swarthmore.edu
COPYRIGHT
Copyright 2000-2001 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
perl(1). AI::Categorize(3)