NAME

Mail::Classifier::GrahamSpam - spam classification based on Paul Graham's algorithm

SYNOPSIS

use Mail::Classifier::GrahamSpam;
$bb = Mail::Classifier::GrahamSpam->new();
$bb->bias( 'NOTSPAM', 2);
$bb->train( { 'spam.mbox' => 'SPAM', 'notspam.mbox' => 'NOTSPAM' } );
my ($cat, $prob) = $bb->score( $msg );

ABSTRACT

Mail::Classifier::GrahamSpam - spam classification based on Paul Graham's algorithm

DESCRIPTION

This class is a specific implementation of a Mail::Classifier that uses Naive Bayesian methods for associating messages with a category. The specific implemenation is based on the article "A Plan for Spam" by Paul Graham (thus the name).

For classic Graham, make sure to set bias on non-spam to 2.

While this class was designed to classify spam and non-spam, there is no underlying limitation that only two categories be used and thus it may be used for more general purposes as well. (And should perhaps be renamed in a subsequent release.) For example, we might call

$bb->train ({   'perl.mbox' => 'PERL',
                'java.mbox' => 'JAVA',
                'php.mbox'  => 'PHP'    });

in order to train the classifier to identify other categories of mail.

METHODS THAT ARE EXTENDED IN THIS SUBCLASS

* new 
* init
* forget
* isvalid
* parse
* learn
* unlearn
* score
new [OPTIONS|FILENAME|CLASSIFIER]

Create a new classifier object, setting any class options by passing a hash-reference to key/value pairs. Alternatively, can be called with a filename from a previous saved classifier, or another classifier object, in which case the classifier will be cloned, duplicating all data and datafiles.

$bb = Mail::Classifier::GrahamSpam->new();
$bb = Mail::Classifier::GrahamSpam->new( { OPTION1 => 'foo', OPTION2 => 'bar' } );
$bb = Mail::Classifier::GrahamSpam->new( "/tmp/saved-classifier" );
$cc = Mail::Classifier::GrahamSpam->new( $bb );

OPTIONS (with default) include:

    debug => 0,                     # Integer debug level
    
    on_disk => 0,                   # if true, will store large tables in 
                                    # scratch db-files, but with poor
                                    # performance 
    
    n_observations_required => 1,   # Ignore words with a count less than this
    
    number_of_predictors => 41,     # Score using this number of words
    
    minimum_word_prob => 0.01,       # Floor for any word's probability

    maximum_word_prob => 0.99,       # Cap for any word's probability
   	
    score_delay => 1,               # Recalculate when learned message count
                                    # exceeds scored message count by this factor

	ignored_tokens => []			# Tokens to ignore while parsing (case insensitive)
	
init

Called during new to initialize the class with default options specific to the class. This includes creating data tables with _add_data_table.

$self->init( {%options} );
forget

Blanks out the frequency data, resetting the classifier to its initial state.

$bb->forget;
isvalid MESSAGE

Confirm that a message can be handled -- e.g. text vs attachment, etc. MESSAGE is a Mail::Message object. In this version, messages are valid if they are of MIME-type "text/*"

$bb->isvalid($msg);

NOTE: Need to add something to limit by character set?

parse MESSAGE

Breaks up a message into tokens -- included are subject and x-mailer headers; the name and e-mail address from a sender/from header (but not the comment, in case this was re-directed for analysis); and all the body lines from all "text" (plain or html) sections of the message. Returns an array of tokens. Splits on anything that isn't alphanumeric, single-quote, underscore, dollar-sign or dash. Ignores single-character words and words that are all numbers.

This parsing could stand to be updated to be more intelligent, preserving IP addresses, e-mail, URL's, etc. Perhaps when I learn Parse::RecDescent.

bias CATEGORY, [BIAS]

This accessor function gets/sets a bias on a category, effectively multiplying the weight of the tokens observed in that category. Paul Graham biased "good" tokens by a factor of two to cut down on false positives. YMMV. Must be > 0 or will silently fail.

Note: No bias is set by default, as the name of the "good" category is up to the user.

$bb->bias( 'NOTSPAM' => 2);
learn CATEGORY, MESSAGE
unlearn CATEGORY, MESSAGE

learn processes a message as an example of a category according to some algorithm. MESSAGE is a Mail::Message.

unlearn reverses the process, for example to "unlearn" a message that has been falsely classified.

In this class, messages are tokenized with parse and the results are added to a count by category for later use by updatepredictors.

$bb->learn('SPAM', $msg);
$bb->unlearn('SPAM', $msg);
score MESSAGE [DETAILS]

Takes a message and returns a list of categories and probabilities in decending order. MESSAGE is a Mail::Message

DETAILS is a optional array to store notes about the details of the calculation. DETAILS will be overwritten.

In this class, score uses the probabilities of the top most significant tokens iteratively over each category and passes them to prediction

my ($cat, $prob) = $bb->score( $msg );

Note: score will take a long time to execute the first time it is called, as it will need to call updatepredictors to refresh.

updatepredictors

Updates the precalculated predictors hash. This function is called periodically whenever enough new messages are learned since the last time it was called.

Per-token predictions are based on the formula used by Graham:

prob(bad) =  
                            ( b / nb ) * bb 
                    --------------------------------
                     (g / ng) * gb + (b / nb ) * bb 


where   b = number of times a token appeared in "bad" messages
        nb = number of bad messages
        bb = bias factor for bad messages
        g = number of times a token appeared in "good" messages
        ng = number of good messages
        gb = bias factor for good messages

except that predictors generalize to the N-category case.

$self->updatepredictors;
prediction ARRAY

prediction takes an array of token probabilities and returns the collective prediction based on all of them taken together

Overall probability based on N tokens comes from Robinson using Fisher's method:

	P = prbx( -2 * sum(ln(1-f(w))), 2*N)
	Q = prbx( -2 * sum(ln (f(w))), 2*N)
	
    prob(category) = (1 + Q - P) /2

	where prbx is inverse chi-squared probability and f(w) is p(category|word);

    $result = $bb->prediction( @predictors );
    

PREREQUISITES

See Mail::Classifier

BUGS

There are always bugs...

AUTHOR

David Golden, <david@hyperbolic.net>

COPYRIGHT AND LICENSE

Copyright 2002 and 2003 by David Golden

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.