NAME
Mail::Miner - Store and retrieve Useful Information from mail
DESCRIPTION
I'm very forgetful, and I tend to rely on my email as a surrogate memory. This is great until you get over 200M of email and can't actually find anything any more. You tend to remember things like "the phone number I need is in a message from Frank around September last year" or "someone sent me a JPG in a message about Tina". This doesn't really help you find the mail in most mail clients, though.
This is where Mail::Miner comes in. It's a generic system for extracting useful information for an email message, storing the information and the message, and allowing both to be extracted through a complex search in the future.
ARCHITECTURE
The principle components of Mail::Miner
are the database, the base modules, assets and recognisers. Let's look at each of these first, then we'll see how they all fit together.
Database
The database schema is provided in miner.sql; naturally, you'll need to create this database according to the schema, and give yourself appropriate permission to the tables. You may or may not need to alter the DBI connect string at the top of DBI.pm too. Be warned that Mail::Miner
only supports mysql, due to there being no standard way of getting the ID of the last inserted row back. (I'll write DBIx::LastInsertID
sometime.)
Those were the database installation instructions. Huh.
Base modules
The base modules don't do very much. Mail::Miner
, the module, does nothing at all, in fact, other than load up the other modules and provide this documentation. Mail::Miner::Message
provides basic functions for dealing with messages, and Mail::Miner::Attachment
does the same thing for attachments. Mail::Miner::Assets
provides some functions which are useful for other modules which manipulate assets. So what are assets?
Assets
Mail::Miner
is Very Stupid. It cares very little about a message; all it really needs to know are what attachments it has, what content the body has, who sent it and what the subject was. In fact, it doesn't really need to care about the last two, but they're used so often, it's convenient to.
Everything else that Mail::Miner
finds out about a mail is an asset. For instance, a very trivial asset is the date it was sent. A more complex asset could be the fact that it looks like it contains a phone number, and what the phone number is.
Recognisers
So how does Mail::Miner
acquire these assets? There are a class of plug-in recogniser modules that get handed a mail message, and store information about them. These are installed just like any other Perl module, and Mail::Miner
automatically detects them and passes them emails. How does this happen?
Operation
Mail::Miner
has two distinct phases of operation: getting data into the database, and getting it back out again.
The first stage happens when a mail is delivered. Mail::Audit
users can use Mail::Audit::Miner
, and procmail
users can use the supplied utility mm_process
to process the message - be warned that these will rewrite the message, so procmail
should use it as a pipe and then continue delivery.
So, a mail comes in, and mm_process
or Mail::Audit
farms it off to Mail::Miner::Message::process()
. This does two things with it: it creates an entry in the database for the mail, and then it strips non-text attachments, flattening the mail to a single piece of text. All attachments are replaced by text like the following:
[ image/jpeg attachment foo.jpg detached - use
mm --detach 12345
to recover ]
(Note that cutting-and-pasting that central line onto a shell prompt will dump foo.jpg into your current directory.)
Next, process
loads up all the Mail::Miner::*
modules it can find in the Perl module search path, and calls their process
subroutine too, if one exists. This allows them to call the various Mail::Miner::Assets
routines to store their assets. After this, the final message, possibly modified by the various process
subroutines, gets written out for delivery.
Here endeth the processing phase.
The next phase is the user-initiated query phase. This is what happens when you call mm
from the command-line. The plugins register keywords that they can act as filters for. For instance, the Mail::Miner::Recogniser::Date
recognizer module registers that it can handle the --dated
command line option. If mm
sees --dated
on the command line, it'll pass the option to Mail::Miner::Recogniser::Date
's toquery
subroutine, which returns some information about to how to create an SQL filter to find messages matching that date specification.
Once all the modules have submitted their SQL WHERE
clauses, Mail::Miner
combines them all together and executes the query. Now the list of candidate mails are sent back to each Mail::Miner::Recogniser
module's postfilter
subroutine to have another chance to filter based on the exact results of the search. Once that's done, we have a set of mails to display to the user.
Mail::Miner::Recogniser::Date
is intended to be used as an example of how to construct recogniser modules.
That's basically how Mail::Miner
works. Have fun with it.
AUTHOR
Simon Cozens
SEE ALSO
Mail::Audit, Mail::Miner::Message, Mail::Miner::Attachment, Mail::Miner::Assets.