Changes for version 0.12 - 2004-02-24

  • tag() now tags with reserved category "UNKNOWN" if no category meets the probability threshold. This guarantees that all messages passed to tag() will receive a header
  • parse() augmented with a separate tokenize() function parse() now tags 'to', 'from', 'subject', and 'mailer' tokens with context
  • tokenize() extracts href/src/mailto host/addresses, strips all html tags, decodes html entities, and does much smarter handling of punctuation for words with punctuation embedded within. Also strips ">>>" forwarding symbols.
  • prediction now uses Robinson-Fisher inverse chi squared to combine individual word predictors

Modules

Perl extension for probabilistic mail classification
spam classification based on Paul Graham's algorithm
a trivial subclass example