NAME
Mail::Classifier - Perl extension for probabilistic mail classification
SYNOPSIS
use Mail::Classifier;
$bb = Mail::Classifier->new();
$bb->train(
{ 'spam.mbox' => 'SPAM',
'nonspam.mbox' => 'NONSPAM'
}
);
%xval = $bb->crossval(
{ 'folds' => 4,
'threshold' => .9,
'corpus_list' => {
'spam.mbox' => 'SPAM',
'nonspam.mbox' => 'NONSPAM
}
}
);
In practice, Mail::Classifier is just a stub that must be overridden in a subclass, but the general interface is documented here. See a subclass for implementation-specific options or extensions.
DESCRIPTION
Mail::Classifier is an abstract base class for mail classification. As such provides capabilities for defining working data tables (which may be stored in memory or on disk) that will persist across saves/restores. It also provides the message handling capabilities necessary to process mailboxes and conduct statistical validation.
Classes inherit from Mail::Classifier to implement a particular classification algorithm or technique. Derived classes must implement methods for learning and scoring messages. Typically, derived classes will also define methods for parsing messages into tokens for use in the learning and scoring methods.
Two derivied classes are included with Mail::Classifier. The first, Mail::Classifier::Trivial is an example of how to extend the base class. The second class, Mail::Classifier::GrahamSpam, implements a Naive Bayesian Filtering based on the article "A Plan For Spam" by Paul Graham (http://www.paulgraham.com/spam.html), and is a fully-functional spam filter. See the RESULTS section, below.
One of the key benefits of Mail::Classifier is built-in support for generating classification matrices, both in the standard approach of a test sample and a holdout sample, or, more powerfully, through cross-validation. Cross-validation divides training data into "N" folds and iteratively scores each fold based on a model built on all remaining folds to maximize available data used in model evaluation. [See "An Introduction to the Bootstrap" by Efron and Tibshirani (1998), p. 239. for more details.] The result is an out-of-sample evaluation of the performance (i.e. accuracy) of the classification engine which can operate on smaller training sets without explicit hold-out samples for validation. This is often preferable for use in development as it validates the algorithm and parameter tuning setting used without requiring a manipulation of separate hold-out samples.
Mail::Classifier is not (yet) an efficient approach to high-volume classification. (It's in Perl, not C.) However, it is ideal for rapid experimentation and testing of classification algorithms, and benefits from Perl Regexp capabilities for exploring alternative message tokenization routines.
METHODS THAT SHOULD/MUST BE EXTENDED IN A SUBCLASS
* new
* init
* forget
* isvalid
* parse
* learn
* unlearn
* score
With the exception of new and init, these methods are little more than stubs. Subclass developers will want to extend these functions to implement a particular classification algorithm and the associated data structures.
In particular, the init function should be extended using _add_data_table to provide data structures used by the subclass, and forget will need to reflect an appropriate "reset" of these data structures.
The other functions are specific to the algorithm and message handling method chosen.
- new [options|FILENAME|CLASSIFIER]
-
Create a new classifier object, setting any class options by passing a hash-reference to key/value pairs. Alternatively, can be called with a filename from a previous saved classifier, or another classifier object, in which case the classifier will be cloned, duplicating all data and datafiles.
$bb = Mail::Classifier->new(); $bb = Mail::Classifier->new( { OPTION1 => 'foo', OPTION2 => 'bar' } ); $bb = Mail::Classifier->new( "/tmp/saved-classifier" ); $cc = Mail::Classifier->new( $bb );
- init
-
Called during new to initialize the class with any options specific to the class. This should include creating data tables with _add_data_table.
$self->init( {%options} );
- forget
-
Blanks out data and structures. Must be implemented by subclasses.
$bb->forget;
- isvalid MESSAGE
-
Confirm that a message can be handled -- e.g. text vs attachment, etc. MESSAGE is a Mail::Message object;
Stub function to be implemented by subclasses. Parent class only returns true.
$bb->isvalid($msg);
- parse MESSAGE
-
breaks up a message into tokens -- this is just a stub for where/how class extensions should place parsing.
$bb->parse($msg);
- learn CATEGORY, MESSAGE
- unlearn CATEGORY, MESSAGE
-
learn processes a message as an example of a category according to some algorithm. MESSAGE is a Mail::Message.
unlearn reverses the process, for example to "unlearn" a message that has been falsely classified.
Stub functions to be implemented in subclasses. Does nothing in parent.
$bb->learn('SPAM', $msg); $bb->unlearn('SPAM', $msg);
- score MESSAGE
-
Takes a message and returns a list of categories and probabilities in decending order. MESSAGE is a Mail::Message
Stub function to be implemented in subclasses. Parent returns ('NONE',1).
($best-cat, $best-cat-prob, @rest) = $bb->score($msg); %probs = $bb->score($msg);
METHODS THAT (PROBABLY) DON'T NEED EXTENSION IN COMMON SUBCLASSES
* train/retrain
* classify
* crossval
* tagmsg
* tagmbox
* save
* setparse -- DEPRECATED
* setconfig
* saveconfig
* loadconfig
* debug
These functions are part of the "standard" interface to Mail::Classifier. In a properly written subclass, these functions will perform as expected with hopefully no modifications.
- train CORPUS-LIST
- retrain CORPUS-LIST
-
Takes a hash of training corpi filenames and categories, walk through each message and learn() from them -- may do some post processing (e.g. trimming a resulting data set). Training corpi must be files that Mail::Box::Manager can recognize and process. (E.g., unix mbox format.)
retrain is the same as train, but erases any prior training first
$bb->train( { 'spam.mbox' => 'SPAM', 'nonspam.mbox' => 'NONSPAM'});
- classify OPTIONS
-
OPTIONS = { threshold' => .9, 'corpus_list' => { 'spam.mbox' => 'SPAM', 'nonspam.mbox' => 'NONSPAM } }
Takes a a probability threshold, plus a hash reference to categories and training corpi filenames. To be counted a message as being scored into a category, the highest probability category returned from score must exceed the threshold or else the message is scored as the reserved category 'UNKNOWN' for unknown.
classify does not destroy prior training -- it merely creates a classification matrix for a given set of data using the existing probabilities.
%xval = $bb->classify( { 'threshold' => .9, 'corpus_list' => { 'spam.mbox' => 'SPAM', 'nonspam.mbox' => 'NONSPAM } } );
- crossval OPTIONS
-
OPTIONS = { 'folds' => 4, 'threshold' => .9, 'corpus_list' => { 'spam.mbox' => 'SPAM', 'nonspam.mbox' => 'NONSPAM } }
Takes a integer number of folds, a probability threshold, plus a hash reference to categories and training corpi filenames. Return a classification table built with N-fold cross validation. To be count a message as being scored into a category, the highest probability category returned from score must exceed the threshold or else the message is scored as the reserved category 'UNKNOWN' for unknown.
crossval destroys prior training -- users should consider cloning and then cross-validating if they do not want to lose prior training. Because of this, cross-validation is a good test of a specific implementation of an algorithm and option settings. To test the validity of the model trained on a particular data set on a new data set, use classify instead.
%xval = $bb->crossval( { 'folds' => 4, 'threshold' => .9, 'corpus_list' => { 'spam.mbox' => 'SPAM', 'nonspam.mbox' => 'NONSPAM } } );
- tagmsg OPTIONS
-
tagmsg takes a Mail::Message object and adds a header with all categories with likelihood over a threshold, returns a new Mail::Message object. Any previous headers of that type are deleted prior to the tagging. If verbose, append details of scoring.
$bb->tagmsg( { 'msg' => $msg, 'threshold' => .9, 'header' => 'X-Mail-Classifier', 'verbose' => 0 } );
- tagmbox OPTIONS
-
tagmbox is like tagmsg, but tags an entire mailbox given by FILENAME.
$bb->tagmbox( { 'mailbox' => '/home/fred/mbox', 'threshold' => .9, 'header' => 'X-Mail-Classifier', 'verbose' => 0 } );
- save FILENAME
-
Dump the entire classifier to a Perl Storable file, given by FILENAME.
$bb->save("/tmp/saved-classifier");
- setparse FUNCTION-REFERENCE
-
DEPRECATED: Used to optionally set an external function for parsing. Now, use of a separate parser should be done by subclassing and overriding the parsing function.
- saveconfig FILENAME
-
Save configuration options only into a textfile
$bb->saveconfig('/tmp/options-only.txt/');
- loadconfig FILENAME
-
Load configuration options from a file. Options are KEY=VALUE pairs. Comments (using '#') and lines with leading whitespace are ignored. Will clobber existing options with the same name.
$bb->loadconfig('/tmp/options-only.txt');
- setconfig HASHREF
-
Sets parameters controlling how messages are processed (e.g. thresholds, defaults probs, caps/floors, etc.). This will clobber existing options so use with caution or override to check and handle appropriately
$bb->setconfig( { option1 => 'foo', option2 => 'bar' });
- debug [INTEGER]
-
Accessor to get/set debug level
$bb->debug(1); print "stuff" if ($bb->debug);
The stub parent uses the following levels:
0: no debugging info 1: basic flow info 5: detailed message-level info
INTERNAL METHODS
These methods should not be called by the end-user, but are listed as a reference for developers subclassing this file.
* _add_data_table
* _load_from_file
* _clone
* Lock/ReadLock/Unlock/LockAll/UnLockAll
* DESTROY
- _add_data_table HASHNAME [STORE-ON-DISK-FLAG]
-
Internal function to add a hash reference to $self object for data storage. If STORE-ON-DISK is true, then a temporary datafile is created with MLDBM::Sync, using MLDBM::Sync::SDBM_File as the underlying datastore and Storable as the data freezing method. Data structures created with this function will be appropriately saved/loaded/locked by other methods. This method will check to ensure existing class members are not overwritten. Note: Using temporary disk files can be exceedingly slow. Use with caution (until I implement a more efficient solution).
$bb->_add_data_table('cache'); $bb->_add_data_table('words', 1); $bb->{words}{perl} = [ 1, 2, 3 ];
- _load_from_file FILENAME
-
Load the classifier from a file, overwriting $self. Internal function called from new.
$self->_load_from_filename('/tmp/saved-classifier');
- _clone OBJECT
-
Clones a classifier/hash's options and data into a new object and returns the new object. Locking of the source OBJECT should be done outside this call.
$cc = $bb->_clone( $bb );
- DESTROY
-
Destructor function -- shouldn't be called by users. Blows away the temporary files when the object is done.
- Lock HASHNAME
- ReadLock HASHNAME
- UnLock HASHNAME
- LockAll
- UnLockAll
-
Wrappers around MLDBM::Sync locking calls. Manages locking on the data hash given in HASHNAME. This will do nothing but is still safe if the hash only lives in memory and is not tied.
$self->Lock('words') # r/w-lock on $self->{words} $self->ReadLock('words') # read-lock on $self->{words} $self->UnLock('words') # unlocks $self->{words} $self->LockAll; # Locks all data tables for $self $self->UnLockAll; # UnLocks all data tables for $self
PREREQUISITES
MLDBM
MLDBM::Sync
File::Copy
Mail::Box
Mail::Address
File::Temp
File::Spec
BUGS
There are always bugs...
SEE ALSO
For more on cross-validation, see "An Introduction to the Bootstrap" by Efron and Tibshirani (1998), p. 239.
Inspiration for this kind of spam classification came from the article "A Plan For Spam" by Paul Graham: http://www.paulgraham.com/spam.html
For a specific implementation of Paul Graham's algorithm, see Mail::SpamTest::Bayesian
Another module of this type, though not integrated with mailbox processing, is AI::Categorizer.
For a public corpus of spam and non-spam for testing, see the SpamAssassin site: http://spamassassin.org/publiccorpus/
AUTHOR
David Golden, <david@hyperbolic.net>
COPYRIGHT AND LICENSE
Copyright 2002 and 2003 by David Golden
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.