NAME

Mail::Classifier - Perl extension for probabilistic mail classification

SYNOPSIS

use Mail::Classifier;
$bb = Mail::Classifier->new();
$bb->train( 
    {   'spam.mbox' => 'SPAM', 
        'nonspam.mbox' => 'NONSPAM'
    }
);         
%xval = $bb->crossval(  
    {   'folds' => 4, 
        'threshold' => .9, 
        'corpus_list' => {   
            'spam.mbox' => 'SPAM',
            'nonspam.mbox' => 'NONSPAM 
        } 
    } 
);

In practice, Mail::Classifier is just a stub that must be overridden in a subclass, but the general interface is documented here. See a subclass for implementation-specific options or extensions.

DESCRIPTION

Mail::Classifier is an abstract base class for mail classification. As such provides capabilities for defining working data tables (which may be stored in memory or on disk) that will persist across saves/restores. It also provides the message handling capabilities necessary to process mailboxes and conduct statistical validation.

Classes inherit from Mail::Classifier to implement a particular classification algorithm or technique. Derived classes must implement methods for learning and scoring messages. Typically, derived classes will also define methods for parsing messages into tokens for use in the learning and scoring methods.

Two derivied classes are included with Mail::Classifier. The first, Mail::Classifier::Trivial is an example of how to extend the base class. The second class, Mail::Classifier::GrahamSpam, implements a Naive Bayesian Filtering based on the article "A Plan For Spam" by Paul Graham (http://www.paulgraham.com/spam.html), and is a fully-functional spam filter. See the RESULTS section, below.

One of the key benefits of Mail::Classifier is built-in support for generating classification matrices, both in the standard approach of a test sample and a holdout sample, or, more powerfully, through cross-validation. Cross-validation divides training data into "N" folds and iteratively scores each fold based on a model built on all remaining folds to maximize available data used in model evaluation. [See "An Introduction to the Bootstrap" by Efron and Tibshirani (1998), p. 239. for more details.] The result is an out-of-sample evaluation of the performance (i.e. accuracy) of the classification engine which can operate on smaller training sets without explicit hold-out samples for validation. This is often preferable for use in development as it validates the algorithm and parameter tuning setting used without requiring a manipulation of separate hold-out samples.

Mail::Classifier is not (yet) an efficient approach to high-volume classification. (It's in Perl, not C.) However, it is ideal for rapid experimentation and testing of classification algorithms, and benefits from Perl Regexp capabilities for exploring alternative message tokenization routines.

METHODS THAT SHOULD/MUST BE EXTENDED IN A SUBCLASS

* new 
* init
* forget
* isvalid
* parse
* learn
* unlearn
* score

With the exception of new and init, these methods are little more than stubs. Subclass developers will want to extend these functions to implement a particular classification algorithm and the associated data structures.

In particular, the init function should be extended using _add_data_table to provide data structures used by the subclass, and forget will need to reflect an appropriate "reset" of these data structures.

The other functions are specific to the algorithm and message handling method chosen.

new [options|FILENAME|CLASSIFIER]

Create a new classifier object, setting any class options by passing a hash-reference to key/value pairs. Alternatively, can be called with a filename from a previous saved classifier, or another classifier object, in which case the classifier will be cloned, duplicating all data and datafiles.

$bb = Mail::Classifier->new();
$bb = Mail::Classifier->new( { OPTION1 => 'foo', OPTION2 => 'bar' } );
$bb = Mail::Classifier->new( "/tmp/saved-classifier" );
$cc = Mail::Classifier->new( $bb );
init

Called during new to initialize the class with any options specific to the class. This should include creating data tables with _add_data_table.

$self->init( {%options} );
forget

Blanks out data and structures. Must be implemented by subclasses.

$bb->forget;
isvalid MESSAGE

Confirm that a message can be handled -- e.g. text vs attachment, etc. MESSAGE is a Mail::Message object;

Stub function to be implemented by subclasses. Parent class only returns true.

$bb->isvalid($msg);
parse MESSAGE

breaks up a message into tokens -- this is just a stub for where/how class extensions should place parsing.

$bb->parse($msg);
learn CATEGORY, MESSAGE
unlearn CATEGORY, MESSAGE

learn processes a message as an example of a category according to some algorithm. MESSAGE is a Mail::Message.

unlearn reverses the process, for example to "unlearn" a message that has been falsely classified.

Stub functions to be implemented in subclasses. Does nothing in parent.

$bb->learn('SPAM', $msg);
$bb->unlearn('SPAM', $msg);
score MESSAGE

Takes a message and returns a list of categories and probabilities in decending order. MESSAGE is a Mail::Message

Stub function to be implemented in subclasses. Parent returns ('NONE',1).

($best-cat, $best-cat-prob, @rest) = $bb->score($msg);
%probs = $bb->score($msg);

METHODS THAT (PROBABLY) DON'T NEED EXTENSION IN COMMON SUBCLASSES

    * train/retrain
	* classify
	* crossval
    * tagmsg
	* tagmbox
    * save
    * setparse  -- DEPRECATED
    * setconfig
    * saveconfig
    * loadconfig
    * debug
    

These functions are part of the "standard" interface to Mail::Classifier. In a properly written subclass, these functions will perform as expected with hopefully no modifications.

train CORPUS-LIST
retrain CORPUS-LIST

Takes a hash of training corpi filenames and categories, walk through each message and learn() from them -- may do some post processing (e.g. trimming a resulting data set). Training corpi must be files that Mail::Box::Manager can recognize and process. (E.g., unix mbox format.)

retrain is the same as train, but erases any prior training first

$bb->train( {   'spam.mbox' => 'SPAM', 
                'nonspam.mbox' => 'NONSPAM'});         
classify OPTIONS
OPTIONS =   {   threshold'   =>  .9,
                'corpus_list' =>  {   'spam.mbox' => 'SPAM',
                                    'nonspam.mbox' => 'NONSPAM }
            }
            

Takes a a probability threshold, plus a hash reference to categories and training corpi filenames. To be counted a message as being scored into a category, the highest probability category returned from score must exceed the threshold or else the message is scored as the reserved category 'UNKNOWN' for unknown.

classify does not destroy prior training -- it merely creates a classification matrix for a given set of data using the existing probabilities.

%xval = $bb->classify(  {   'threshold' => .9, 
                            'corpus_list' => 
                                {   'spam.mbox' => 'SPAM',
                                    'nonspam.mbox' => 'NONSPAM } } );
crossval OPTIONS
OPTIONS =   {   'folds'       =>  4,
                'threshold'   =>  .9,
                'corpus_list' =>  {   'spam.mbox' => 'SPAM',
                                    'nonspam.mbox' => 'NONSPAM }
            }
            

Takes a integer number of folds, a probability threshold, plus a hash reference to categories and training corpi filenames. Return a classification table built with N-fold cross validation. To be count a message as being scored into a category, the highest probability category returned from score must exceed the threshold or else the message is scored as the reserved category 'UNKNOWN' for unknown.

crossval destroys prior training -- users should consider cloning and then cross-validating if they do not want to lose prior training. Because of this, cross-validation is a good test of a specific implementation of an algorithm and option settings. To test the validity of the model trained on a particular data set on a new data set, use classify instead.

%xval = $bb->crossval(  {   'folds' => 4, 'threshold' => .9, 
                            'corpus_list' => 
                                {   'spam.mbox' => 'SPAM',
                                    'nonspam.mbox' => 'NONSPAM } } );
tagmsg OPTIONS

tagmsg takes a Mail::Message object and adds a header with all categories with likelihood over a threshold, returns a new Mail::Message object. Any previous headers of that type are deleted prior to the tagging. If verbose, append details of scoring.

    $bb->tagmsg(   {    'msg' => $msg,
                        'threshold' =>  .9,
                        'header' => 'X-Mail-Classifier',
						'verbose' => 0 } );
tagmbox OPTIONS

tagmbox is like tagmsg, but tags an entire mailbox given by FILENAME.

    $bb->tagmbox(   {   'mailbox' => '/home/fred/mbox',
                        'threshold' =>  .9,
                        'header' => 'X-Mail-Classifier',
						'verbose' => 0 } );
save FILENAME

Dump the entire classifier to a Perl Storable file, given by FILENAME.

$bb->save("/tmp/saved-classifier");
setparse FUNCTION-REFERENCE

DEPRECATED: Used to optionally set an external function for parsing. Now, use of a separate parser should be done by subclassing and overriding the parsing function.

saveconfig FILENAME

Save configuration options only into a textfile

$bb->saveconfig('/tmp/options-only.txt/');
loadconfig FILENAME

Load configuration options from a file. Options are KEY=VALUE pairs. Comments (using '#') and lines with leading whitespace are ignored. Will clobber existing options with the same name.

$bb->loadconfig('/tmp/options-only.txt');
setconfig HASHREF

Sets parameters controlling how messages are processed (e.g. thresholds, defaults probs, caps/floors, etc.). This will clobber existing options so use with caution or override to check and handle appropriately

$bb->setconfig( { option1 => 'foo', option2 => 'bar' });
debug [INTEGER]

Accessor to get/set debug level

$bb->debug(1);
print "stuff" if ($bb->debug);

The stub parent uses the following levels:

   0: no debugging info
   1: basic flow info
   5: detailed message-level info

INTERNAL METHODS

These methods should not be called by the end-user, but are listed as a reference for developers subclassing this file.

* _add_data_table
* _load_from_file
* _clone
* Lock/ReadLock/Unlock/LockAll/UnLockAll
* DESTROY
_add_data_table HASHNAME [STORE-ON-DISK-FLAG]

Internal function to add a hash reference to $self object for data storage. If STORE-ON-DISK is true, then a temporary datafile is created with MLDBM::Sync, using MLDBM::Sync::SDBM_File as the underlying datastore and Storable as the data freezing method. Data structures created with this function will be appropriately saved/loaded/locked by other methods. This method will check to ensure existing class members are not overwritten. Note: Using temporary disk files can be exceedingly slow. Use with caution (until I implement a more efficient solution).

$bb->_add_data_table('cache');
$bb->_add_data_table('words', 1);
$bb->{words}{perl} = [ 1, 2, 3 ];
_load_from_file FILENAME

Load the classifier from a file, overwriting $self. Internal function called from new.

$self->_load_from_filename('/tmp/saved-classifier');
_clone OBJECT

Clones a classifier/hash's options and data into a new object and returns the new object. Locking of the source OBJECT should be done outside this call.

$cc = $bb->_clone( $bb );
DESTROY

Destructor function -- shouldn't be called by users. Blows away the temporary files when the object is done.

Lock HASHNAME
ReadLock HASHNAME
UnLock HASHNAME
LockAll
UnLockAll

Wrappers around MLDBM::Sync locking calls. Manages locking on the data hash given in HASHNAME. This will do nothing but is still safe if the hash only lives in memory and is not tied.

$self->Lock('words')      # r/w-lock on $self->{words} 
$self->ReadLock('words')  # read-lock on $self->{words}
$self->UnLock('words')    # unlocks $self->{words}

$self->LockAll;            # Locks all data tables for $self
$self->UnLockAll;          # UnLocks all data tables for $self
    

PREREQUISITES

MLDBM 
MLDBM::Sync
File::Copy
Mail::Box
Mail::Address
File::Temp
File::Spec

BUGS

There are always bugs...

SEE ALSO

For more on cross-validation, see "An Introduction to the Bootstrap" by Efron and Tibshirani (1998), p. 239.

Inspiration for this kind of spam classification came from the article "A Plan For Spam" by Paul Graham: http://www.paulgraham.com/spam.html

For a specific implementation of Paul Graham's algorithm, see Mail::SpamTest::Bayesian

Another module of this type, though not integrated with mailbox processing, is AI::Categorizer.

For a public corpus of spam and non-spam for testing, see the SpamAssassin site: http://spamassassin.org/publiccorpus/

AUTHOR

David Golden, <david@hyperbolic.net>

COPYRIGHT AND LICENSE

Copyright 2002 and 2003 by David Golden

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.