NAME

AI::Categorizer - Automatic Text Categorization

SYNOPSIS

use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);

# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;

# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;

# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
  my $d = ...create a document...
  my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
  print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
  print "Best category: ", $hypothesis->best_category, "\n";
}

DESCRIPTION

AI::Categorizer is a framework for automatic text categorization. It consists of a collection of Perl modules that implement common categorization tasks, and a set of defined relationships among those modules. The various details are flexible - for example, you can choose what categorization algorithm to use, what features (words or otherwise) of the documents should be used (or how to automatically choose these features), what format the documents are in, and so on.

The basic process of using this module will typically involve obtaining a collection of pre-categorized documents, creating a knowledge set representation of those documents, training a categorizer on that knowledge set, and saving the trained categorizer for later use. There are several ways to carry out this process. The top-level AI::Categorizer module provides an umbrella class for high-level operations, or you may use the interfaces of the individual classes in the framework.

Disclaimer: the results of any of the machine learning algorithms are far from infallible (close to fallible?). Categorization of documents is often a difficult task even for humans well-trained in the particular domain of knowledge, and there are many things a human would consider that none of these algorithms consider. These are only statistical tests - at best they are neat tricks or helpful assistants, and at worst they are totally unreliable. If you plan to use this module for anything important, human supervision is essential, both of the categorization process and the final results.

For the usage details, please see the documentation of each individual module.

FRAMEWORK COMPONENTS

This section explains the major pieces of the AI::Categorizer object framework. This section gives a conceptual overview, but does not get into any of the details about interfaces or usage. See the documentation for the individual classes for more details.

A diagram of the various classes in the framework can be seen in doc/classes.png.

Knowledge Sets

A "knowledge set" is defined as a collection of documents, stored in a particular format, together with some information on the categories each document belongs to. Note that this term is somewhat unique to this project - other sources may call it a "training corpus", or "prior knowledge". A knowledge set also contains some information on how documents will be parsed and how their features (words) will be extracted and culled. In this sense, a knowledge set represents not only a collection of data, but a particular view on that data.

A knowledge set is encapsulated by the AI::Categorizer::KnowledgeSet class. Before you can start playing with categorizers, you will have to start playing with knowledge sets, so that the categorizers have some data to train on. See the documentation for the AI::Categorizer::KnowledgeSet module for information on its interface.

Feature selection

Deciding which features are the most important is a very large part of the categorization task - you cannot simply consider all the words in all the documents when training, and all the words in the document being categorized. There are two main reasons for this - first, it would mean that your training and categorizing processes would take forever and use tons of memory, and second, the significant bits of the documents would get lost in the "noise" of the insignificant bits.

The process of selecting the most important features in the training set is called "feature selection". It is managed by the AI::Categorizer::KnowledgeSet class, and you will find the details of feature selection processes in that class's documentation.

Collections

Because documents may be stored in lots of different formats, a Collection class has been created as an abstraction of a stored set of documents, together with a way to iterate through the set and return Document objects. A KnowledgeSet contains a single collection object. A Categorizer generally contains two collections, one for training and one for testing. A Learner can mass-categorize a collection.

Categorization Algorithms

Each categorization algorithm is a subclass of AI::Categorizer::Learner. Currently the framework only includes one categorizer in its default distribution, AI::Categorizer::Learner::NaiveBayes.

There will soon be a Neural Network categorizer. Next on the agenda will/may be a k-Nearest-Neighbor algorithm, a decision tree algorithm, a mixture-of-experts combiner, and/or a general interface to the "Weka" machine learning system. No timetable for their creation has yet been set.

Please see the documentation of these individual modules for more details on their guts and quirks. See the AI::Categorizer::Learner documentation for a description of the general categorizer interface.

Feature Vectors

Most categorization algorithms don't deal directly with a document's data, they instead deal with a vector representation of a document's features. The features may be any properties of the document that seem indicative of its category, but they are usually some version of the "most important" words in the document. A list of features and their weights in each document is encapsulated by the AI::Categorizer::FeatureVector class. You may think of this class as roughly analogous to a Perl hash, where the keys are the names of features and the values are their weights.

Hypotheses

The result of asking a categorizer to categorize a previously unseen document is called a hypothesis, because it is some kind of "statistical guess" of what categories this document should be assigned to. Since you may be interested in any of several pieces of information about the hypothesis (for instance, which categories were assigned, which category was the single most likely category, the scores assigned to each category, etc.), the hypothesis is returned as an object of the AI::Categorizer::Hypothesis class, and you can use its object methods to get information about the hypothesis. See its class documentation for the details.

Experiments

The AI::Categorizer::Experiment class helps you organize the results of categorization experiments. As you get lots of categorization results (Hypotheses) back from the Learner, you can feed these results to the Experiment class, along with the correct answers. When all results have been collected, you can get a report on accuracy, precision, recall, F1, and so on, with both micro-averaging and macro-averaging over categories. See the docs for AI::Categorizer::Experiment for more details.

METHODS

new()

Creates a new Categorizer object and returns it. Accepts lots of parameters controlling behavior. In addition to the parameters listed here, you may pass any parameter accepted by any class that we create internally (the KnowledgeSet, Learner, Experiment, or Collection classes). This is managed by the Class::Container module, so see its documentation for the details of how this works.

The specific parameters accepted here are:

progress_file: A string that indicates a place where objects will be saved during several of the methods of this class. The default value is the string save, which means files like save-01-knowledge_set will get created. The exact names of these files may change in future releases, since they're just used internally to resume where we last left off.
verbose: If true, a few status messages will be printed during execution.
data_root: A shortcut for setting the training_set, test_set, and category_file parameters separately. Sets training_set to $data_root/training, test_set to $data_root/test, and category_file (used by some of the Collection classes) to $data_root/cats.txt.
training_set: Specifies the path parameter that will be fed to the KnowledgeSet's scan_features() and read() methods during our scan_features() and read_training_set() methods.
test_set: Specifies the path parameter that will be used when creating a Collection during the evaluate_test_set() method.
stopword_file: Specifies a file containing a list of "stopwords", which are words that should automatically be disregarded when scanning/reading documents. The file should contain one word per line. The file will be parsed and then fed as the stopwords parameter to the KnowledgeSet new() method.

learner()

Returns the Learner object associated with this Categorizer. If learner() is called before train(), the Learner will of course not be trained yet.

knowledge_set()

Returns the KnowledgeSet object associated with this Categorizer. If read_training_set() has not yet been called, the KnowledgeSet will not yet be populated with any training data.

run_experiment()

Runs a complete experiment on the training and testing data, reporting the results on STDOUT. Internally, this is just a shortcut for calling the scan_features(), read_training_set(), train(), and evaluate_test_set() methods, then printing the value of the stats_table() method.

scan_features()

Scans the Collection specified in the test_set parameter to determine the set of features (words) that will be considered when training the Learner. Internally, this calls the scan_features() method of the KnowledgeSet, then saves the KnowledgeSet for later use.

This step is not strictly necessary, but it can dramatically reduce memory requirements if you scan for features before reading the entire corpus into memory.

read_training_set()

Populates the KnowledgeSet with the data specified in the test_set parameter. Internally, this call the read() method of the KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet object for later use.

train()

Calls the Learner's train() method, passing it the KnowledgeSet populated during read_training_set(). Returns the Learner object. Also save the Learner object for later use.

evaluate_test_set()

Creates a Collection based on the value of the test_set parameter, and calls the Learner's categorize_collection() method using this Collection. Returns the resultant Experiment object. Also saves the Experiment object for later use in the stats_table() method.

stats_table()

Returns the value of the Experiment's (as created by evaluate_test_set()) stats_table() method. This is a string that shows various statistics about the accuracy/precision/recall/F1/etc. of the assignments made during testing.

HISTORY

This module is a revised and redesigned version of the previous AI::Categorize module by the same author. Note the added 'r' in the new name. The older module has a different interface, and no attempt at backward compatibility has been made - that's why I changed the name.

You can have both AI::Categorize and AI::Categorizer installed at the same time on the same machine, if you want. They don't know about each other or use conflicting namespaces.

AUTHOR

Ken Williams <kenw@ee.usyd.edu.au>

REFERENCES

http://www.d.umn.edu/~tpederse/nsp.html (could be used later for feature selection)

COPYRIGHT

This distribution is free software; you can redistribute it and/or modify it under the same terms as Perl itself. These terms apply to every file in the distribution - if you have questions, please contact the author.

To install AI::Categorizer, copy and paste the appropriate command in to your terminal.

cpanm

cpanm AI::Categorizer

CPAN shell

perl -MCPAN -e shell
install AI::Categorizer

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)