NAME
Mail::Classifier::GrahamSpam - spam classification based on Paul Graham's algorithm
SYNOPSIS
use Mail::Classifier::GrahamSpam;
$bb = Mail::Classifier::GrahamSpam->new();
$bb->bias( 'NOTSPAM', 2);
$bb->train( { 'spam.mbox' => 'SPAM', 'notspam.mbox' => 'NOTSPAM' } );
my ($cat, $prob) = $bb->score( $msg );
ABSTRACT
Mail::Classifier::GrahamSpam - spam classification based on Paul Graham's algorithm
DESCRIPTION
This class is a specific implementation of a Mail::Classifier that uses Naive Bayesian methods for associating messages with a category. The specific implemenation is based on the article "A Plan for Spam" by Paul Graham (thus the name).
For classic Graham, make sure to set bias on non-spam to 2.
While this class was designed to classify spam and non-spam, there is no underlying limitation that only two categories be used and thus it may be used for more general purposes as well. (And should perhaps be renamed in a subsequent release.) For example, we might call
$bb->train ({ 'perl.mbox' => 'PERL',
'java.mbox' => 'JAVA',
'php.mbox' => 'PHP' });
in order to train the classifier to identify other categories of mail.
METHODS THAT ARE EXTENDED IN THIS SUBCLASS
* new
* init
* forget
* isvalid
* parse
* learn
* unlearn
* score
- new [OPTIONS|FILENAME|CLASSIFIER]
-
Create a new classifier object, setting any class options by passing a hash-reference to key/value pairs. Alternatively, can be called with a filename from a previous saved classifier, or another classifier object, in which case the classifier will be cloned, duplicating all data and datafiles.
$bb = Mail::Classifier::GrahamSpam->new(); $bb = Mail::Classifier::GrahamSpam->new( { OPTION1 => 'foo', OPTION2 => 'bar' } ); $bb = Mail::Classifier::GrahamSpam->new( "/tmp/saved-classifier" ); $cc = Mail::Classifier::GrahamSpam->new( $bb );
OPTIONS (with default) include:
debug => 0, # Integer debug level on_disk => 0, # if true, will store large tables in # scratch db-files, but with poor # performance n_observations_required => 1, # Ignore words with a count less than this number_of_predictors => 41, # Score using this number of words minimum_word_prob => 0.01, # Floor for any word's probability maximum_word_prob => 0.99, # Cap for any word's probability score_delay => 1, # Recalculate when learned message count # exceeds scored message count by this factor ignored_tokens => [] # Tokens to ignore while parsing (case insensitive)
- init
-
Called during new to initialize the class with default options specific to the class. This includes creating data tables with _add_data_table.
$self->init( {%options} );
- forget
-
Blanks out the frequency data, resetting the classifier to its initial state.
$bb->forget;
- isvalid MESSAGE
-
Confirm that a message can be handled -- e.g. text vs attachment, etc. MESSAGE is a Mail::Message object. In this version, messages are valid if they are of MIME-type "text/*"
$bb->isvalid($msg);
NOTE: Need to add something to limit by character set?
- parse MESSAGE
-
Breaks up a message into tokens -- included are subject and x-mailer headers; the name and e-mail address from a sender/from header (but not the comment, in case this was re-directed for analysis); and all the body lines from all "text" (plain or html) sections of the message. Returns an array of tokens. Splits on anything that isn't alphanumeric, single-quote, underscore, dollar-sign or dash. Ignores single-character words and words that are all numbers.
This parsing could stand to be updated to be more intelligent, preserving IP addresses, e-mail, URL's, etc. Perhaps when I learn Parse::RecDescent.
- bias CATEGORY, [BIAS]
-
This accessor function gets/sets a bias on a category, effectively multiplying the weight of the tokens observed in that category. Paul Graham biased "good" tokens by a factor of two to cut down on false positives. YMMV. Must be > 0 or will silently fail.
Note: No bias is set by default, as the name of the "good" category is up to the user.
$bb->bias( 'NOTSPAM' => 2);
- learn CATEGORY, MESSAGE
- unlearn CATEGORY, MESSAGE
-
learn processes a message as an example of a category according to some algorithm. MESSAGE is a Mail::Message.
unlearn reverses the process, for example to "unlearn" a message that has been falsely classified.
In this class, messages are tokenized with parse and the results are added to a count by category for later use by updatepredictors.
$bb->learn('SPAM', $msg); $bb->unlearn('SPAM', $msg);
- score MESSAGE [DETAILS]
-
Takes a message and returns a list of categories and probabilities in decending order. MESSAGE is a Mail::Message
DETAILS is a optional array to store notes about the details of the calculation. DETAILS will be overwritten.
In this class, score uses the probabilities of the top most significant tokens iteratively over each category and passes them to prediction
my ($cat, $prob) = $bb->score( $msg );
Note: score will take a long time to execute the first time it is called, as it will need to call updatepredictors to refresh.
- updatepredictors
-
Updates the precalculated predictors hash. This function is called periodically whenever enough new messages are learned since the last time it was called.
Per-token predictions are based on the formula used by Graham:
prob(bad) = ( b / nb ) * bb -------------------------------- (g / ng) * gb + (b / nb ) * bb where b = number of times a token appeared in "bad" messages nb = number of bad messages bb = bias factor for bad messages g = number of times a token appeared in "good" messages ng = number of good messages gb = bias factor for good messages
except that predictors generalize to the N-category case.
$self->updatepredictors;
- prediction ARRAY
-
prediction takes an array of token probabilities and returns the collective prediction based on all of them taken together
Overall probability based on N tokens comes from Robinson using Fisher's method:
P = prbx( -2 * sum(ln(1-f(w))), 2*N) Q = prbx( -2 * sum(ln (f(w))), 2*N) prob(category) = (1 + Q - P) /2 where prbx is inverse chi-squared probability and f(w) is p(category|word); $result = $bb->prediction( @predictors );
PREREQUISITES
See Mail::Classifier
BUGS
There are always bugs...
AUTHOR
David Golden, <david@hyperbolic.net>
COPYRIGHT AND LICENSE
Copyright 2002 and 2003 by David Golden
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.