NAME

Text::SpamAssassin - Detects spamminess of arbitrary text, suitable for wiki and blog defense

VERSION

version 2.001

SYNOPSIS

use Text::SpamAssassin;

my $sa = Text::SpamAssassin->new(
    sa_options => {
        userprefs_filename => 'comment_spam_prefs.cf',
    },
);

$sa->set_text($content);

my $result = $sa->analyze;
print "result: $result->{verdict}\n";

DESCRIPTION

Text::SpamAssassin is a wrapper around Mail::SpamAssassin that makes it easy to check simple blocks of text or HTML for spam content. Its main purpose is to help integrate SpamAssassin into non-mail contexts like blog comments. It works by creating a minimal email message based on the text or HTML you pass it, then handing that email to SpamAssassin for analysis. See "MESSAGE GENERATION" for more details.

CONSTRUCTOR

my $sa = Text::SpamAssassin->new(
    sa_options => {
        userprefs_filename => 'comment_spam_prefs.cf',
    },
);

As well as initializing the object the constructor creates a Mail::SpamAssassin object for the actual analysis work. The following options may be passed to the constructor

sa_options

A hashref. This will be passed as-is to the Mail::SpamAssassin constructor. At the very least you probably want to provide the userprefs_filename as the default configuration isn't particularly well suited to non-mail spam. See "SPAMASSASSIN CONFIGURATION" for details.

lazy

By default the Mail::SpamAssassin object will be fully created in the Text::SpamAssassin constructor. This requires it to compile the rulesets and load any modules it needs which can take a little while. If the lazy option is set to a true value, this setup will be deferred until the first scan is done.

METHODS

All the set_* and reset_* methods return a copy of Text::SpamAssasin object they are invoked on to allow easy call chaining:

my $result = $sa->reset
                ->set_text("comments")
                ->set_metadata("ip", "127.0.0.1");
                ->analyze;

set_text

$sa->set_text("some comment text");

Store some text content and stores it for later analysis. Any content previously set with set_text or set_html will be overwritten.

set_html

$sa->set_html("<p>see <a href='#'>here</a> for more info</p>");

Store some HTML content and stores it for later analysis. Any content previously set with set_text or set_html will be overwritten.

set_header

$sa->set_header("Subject", "your blog is stupid");

Set a header that will be added to the constructed message that gets passed to SpamAssassin. This will override any header of the same name that would normally be generated by Text::SpamAssassin. To set multiple headers with the same name, provide an arrayref as the value instead.

set_metadata

$sa->set_metadata("ip", "127.0.0.1");

Sets metadata related to the text, usually taken from additional fields in a blog comment form. Some of these values are used when constructing the message header for SpamAssassin. When scanning text (but not HTML) this data will also be added to the message body so they can be scanned. Any additional data that you want scanned (such as URLs) should be added here.

reset

$sa->reset;

Calls reset_headers and reset_headers to reset the object state. You should use this if you have a long-lived Text::SpamAssassin object that will be used multiple times.

reset_headers

$sa->reset_headers;

Removes any headers previously set with set_header.

reset_metadata

$sa->reset_metadata;

Removes any metadata previously set with set_metadata.

analyze

my $result = $sa->analyze;

Scan the previously-supplied data. Returns a hashref containing three values:

verdict

One of the following values:

OK

The message was considered to be clean by SpamAssassin.

SUSPICIOUS

The message was considered to be spam by SpamAssassin.

UNKNOWN

The scan failed for an unknown reason.

score

The score that SpamAssassin gave the message.

rules

The list of rules that SpamAssassin matched when considering the message.

MESSAGE GENERATION

Because SpamAssassin only knows how to scan email messages, its necessary for Text::SpamAssassin to generate a message from the data you provide. This section details how that message is created.

A message body is created from the supplied text or HTML data and the supplied metadata. If text is supplied then the message body contains the data supplied to set_metadata as lines of "key: value", one per line, followed by the supplied message text. If HTML supplied then the body is wrapped in a HTML doctype and header, and the metadata is included as a unordered list.

The header is mostly hardcoded, but the following metadata items will be included if present.

author

Included in the From: header as the sender name.

email

Included in the From: header as the sender address.

subject

Used as-is for the Subject: header.

ip

Included in the Received: header as the originating IP.

Sane defaults will be used for any metadata that is not provided.

Additionally, the Content-Type: will be set to either text/plain or text/html depending on the type of message content provided.

SPAMASSASSIN CONFIGURATION

By default SpamAssassin is configured in a way that does a good job of detecting spam in email traffic. Many of its rules that work well in that context are unsuitable for use in other scenarios. An example of this is DUN/DUL rulesets that check for known "dial-up" IP networks (such as those used by ISP customers) are almost always useless for something that scans blog comments, as you likely want home users to be able to comment on our blog when you'd never dream of accepting mail from them directly.

For this reason, its highly recommended that you specify an alternate configuration using the userprefs_filename option in sa_options. Sample configuration files can be found in the examples directory of the Text-SpamAssassin distribution.

BUGS

None known. Please report bugs via the CPAN Request Tracker at http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-SpamAssassin

FEEDBACK

If you find this module useful, please consider rating it on the CPAN Ratings service at http://cpanratings.perl.org/rate?distribution=Text-SpamAssassin.

If you like (or hate) this module, please tell the author! Send mail to <rob@eatenbyagrue.org>.

SEE ALSO

Mail::SpamAssassin

http://apthorpe.cynistar.net/code/babycart/

AUTHOR

Originally by Bob Apthorpe <apthorpe+babycart@cynistar.net>

Cleanup for 2.0 and CPAN release by Robert Norris <rob@eatenbyagrue.org>

COPYRIGHT AND LICENSE

Copyright 2004 by Bob Apthorpe

Copyright 2010 by Robert Norris

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.