NAME
MIME::ParserBase - abstract class for parsing MIME mail
SYNOPSIS
This is an abstract class; however, here's how one of its concrete subclasses is used:
use MIME::Parser;
# Create a new parser object:
my $parser = new MIME::Parser;
# Parse an input stream:
$entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream";
# Congratulations: you now have a (possibly multipart) MIME entity!
$entity->dump_skeleton; # for debugging
There are also some convenience methods:
# Parse an in-core MIME message:
$entity = $parser->parse_data($message)
|| die "couldn't parse MIME message";
# Parse already-split input (as "deliver" would give it to you):
$entity = $parser->parse_two("msg.head", "msg.body")
|| die "couldn't parse MIME files";
In case a parse fails, it's nice to know who sent it to us. So...
# Parse an input stream:
$entity = $parser->read(\*STDIN);
if (!$entity) { # oops!
my $decapitated = $parser->last_head; # last top-level head
}
You can also alter the behavior of the parser:
# Parse contained "message/rfc822" objects as nested MIME streams:
$parser->parse_nested_messages('REPLACE');
# Automatically attempt to RFC-1522-decode the MIME headers:
$parser->decode_headers(1);
DESCRIPTION
Where it all begins.
This is the class that contains all the knowledge for parsing MIME streams. It's an abstract class, containing no methods governing the output of the parsed entities: such methods belong in the concrete subclasses.
You can inherit from this class to create your own subclasses that parse MIME streams into MIME::Entity objects. One such subclass, MIME::Parser, is already provided in this kit.
PUBLIC INTERFACE
Construction, and setting options
- new ARGS...
-
Class method. Create a new parser object. Passes any subsequent arguments onto the
init()
method.Once you create a parser object, you can then set up various parameters before doing the actual parsing. Here's an example using one of our concrete subclasses:
my $parser = new MIME::Parser; $parser->output_dir("/tmp"); $parser->output_prefix("msg1"); my $entity = $parser->read(\*STDIN);
- decode_headers ONOFF
-
Instance method. If set true, then the parser will attempt to decode the MIME headers as per RFC-1522 the moment it sees them. This will probably be of most use to those of you who expect some international mail, especially mail from individuals with 8-bit characters in their names.
If set false, no attempt at decoding will be done.
With no argument, just returns the current setting.
Warning: some folks already have code which assumes that no decoding is done, and since this is pretty new and radical stuff, I have initially made "off" the default setting for backwards compatibility in 2.05. However, I will possibly change this in future releases, so please: if you want a particular setting, declare it when you create your parser object.
- interface ROLE,[VALUE]
-
Instance method. During parsing, the parser normally creates instances of certain classes, like MIME::Entity. However, you may want to create a parser subclass that uses your own experimental head, entity, etc. classes (for example, your "head" class may provide some additional MIME-field-oriented methods).
If so, then this is the method that your subclass should invoke during init. Use it like this:
package MyParser; @ISA = qw(MIME::Parser); ... sub init { my $self = shift; $self->SUPER::init(@_); # do my parent's init $self->interface(ENTITY_CLASS => 'MIME::MyEntity'); $self->interface(HEAD_CLASS => 'MIME::MyHead'); $self; # return }
With no VALUE, returns the VALUE currently associated with that ROLE.
- last_head
-
Instance method. Return the top-level MIME header of the last stream we attempted to parse. This is useful for replying to people who sent us bad MIME messages.
# Parse an input stream: $entity = $parser->read(\*STDIN); if (!$entity) { # oops! my $decapitated = $parser->last_head; # last top-level head }
- parse_nested_messages OPTION
-
Instance method. Some MIME messages will contain a part of type
message/rfc822
: literally, the text of an embedded mail/news/whatever message. The normal behavior is to save such a message just as if it were atext/plain
document, without attempting to decode it. However, you can change this: before parsing, invoke this method with the OPTION you want:If OPTION is false, the normal behavior will be used.
If OPTION is true, the body of the
message/rfc822
part is decoded (after all, it might be encoded!) into a temporary filehandle, which is then rewound and parsed by this parser, creating an entity object. What happens then is determined by the OPTION:- NEST or 1
-
The contained message becomes a "part" of the
message/rfc822
entity, as though themessage/rfc822
were a special kind ofmultipart
entity. However, themessage/rfc822
header (and the content-type) is retained.Warning: since it is not legal MIME for anything but
multipart
to have a "part", themessage/rfc822
message will appear to have no content if you simplyprint()
it out. You will have to have to get at the reparsed body manually, by theMIME::Entity::parts()
method.IMHO, this option is probably only useful if you're processing messages, but not saving or re-sending them. In such cases, it is best to not use "parse nested" at all.
- REPLACE
-
The contained message replaces the
message/rfc822
entity, as though themessage/rfc822
"envelope" never existed.Warning: notice that, with this option, all the header information in the
message/rfc822
header is lost. This might seriously bother you if you're dealing with a top-level message, and you've just lost the sender's address and the subject line.:-/
.
Thanks to Andreas Koenig for suggesting this method.
Parsing messages
- parse_data DATA
-
Instance method. Parse a MIME message that's already in-core. You may supply the DATA in any of a number of ways...
A scalar which holds the message.
A ref to a scalar which holds the message. This is an efficiency hack.
A ref to an array of scalars. The array elements are simply joined to produce a scalar; no newlines are inserted!
Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.
Note: where the parsed body parts are stored (e.g., in-core vs. on-disk) is not determined by this class, but by the subclass you use to do the actual parsing (e.g., MIME::Parser). For efficiency, if you know you'll be parsing a small amount of data, it is probably best to tell the parser to store the parsed parts in core. For example, here's a short test program, using MIME::Parser:
use MIME::Parser; my $msg = <<EOF; Content-type: text/html Content-transfer-encoding: 7bit <H1>Hello, world!</H1>; EOF $parser = new MIME::Parser; $parser->output_to_core('ALL'); $entity = $parser->parse_data($msg); $entity->print(\*STDOUT);
- parse_two HEADFILE, BODYFILE
-
Instance method. Convenience front-end onto
read()
, intended for programs running under mail-handlers like deliver, which splits the incoming mail message into a header file and a body file.Simply give this method the paths to the respective files. These must be pathnames: Perl "open-able" expressions won't work, since the pathnames are shell-quoted for safety.
WARNING: it is assumed that, once the files are cat'ed together, there will be a blank line separating the head part and the body part.
Returns the parsed entity, or undef on error.
- read INSTREAM
-
Instance method. Takes a MIME-stream and splits it into its component entities, each of which is decoded and placed in a separate file in the splitter's output_dir().
The INSTREAM can be given as a readable FileHandle, a globref'd filehandle (like
\*STDIN
), or as any blessed object conforming to the MIME::IO (or IO::) interface.Returns a MIME::Entity, which may be a single entity, or an arbitrarily-nested multipart entity. Returns undef on failure.
WRITING SUBCLASSES
All you have to do to write a subclass is to provide or override the following methods:
- init ARGS...
-
Instance method, private. Initiallize the new parser object, with any args passed to
new()
.You don't need to override this in your subclass. If you override it, however, make sure you call the inherited method to init your parents!
package MyParser; @ISA = qw(MIME::Parser); ... sub init { my $self = shift; $self->SUPER::init(@_); # do my parent's init # ...my init stuff goes here... $self; # return }
Should return the self object on success, and undef on failure.
- new_body_for HEAD
-
Abstract instance method. Based on the HEAD of a part we are parsing, return a new body object (any desirable subclass of MIME::Body) for receiving that part's data (both will be put into the "entity" object for that part).
If you want the parser to do something other than write its parts out to files, you should override this method in a subclass. For an example, see MIME::Parser.
Note: the reason that we don't use the "interface" mechanism for this is that your choice of (1) which body class to use, and (2) how its
new()
method is invoked, may be very much based on the information in the header.
You are of course free to override any other methods as you see fit, like new
.
NOTES
This is an abstract class. If you actually want to parse a MIME stream, use one of the children of this class, like the backwards-compatible MIME::Parser.
Under the hood
RFC-1521 gives us the following BNF grammar for the body of a multipart MIME message:
multipart-body := preamble 1*encapsulation close-delimiter epilogue
encapsulation := delimiter body-part CRLF
delimiter := "--" boundary CRLF
; taken from Content-Type field.
; There must be no space between "--"
; and boundary.
close-delimiter := "--" boundary "--" CRLF
; Again, no space by "--"
preamble := discard-text
; to be ignored upon receipt.
epilogue := discard-text
; to be ignored upon receipt.
discard-text := *(*text CRLF)
body-part := <"message" as defined in RFC 822, with all
header fields optional, and with the specified
delimiter not occurring anywhere in the message
body, either on a line by itself or as a substring
anywhere. Note that the semantics of a part
differ from the semantics of a message, as
described in the text.>
From this we glean the following algorithm for parsing a MIME stream:
PROCEDURE parse
INPUT
A FILEHANDLE for the stream.
An optional end-of-stream OUTER_BOUND (for a nested multipart message).
RETURNS
The (possibly-multipart) ENTITY that was parsed.
A STATE indicating how we left things: "END" or "ERROR".
BEGIN
LET OUTER_DELIM = "--OUTER_BOUND".
LET OUTER_CLOSE = "--OUTER_BOUND--".
LET ENTITY = a new MIME entity object.
LET STATE = "OK".
Parse the (possibly empty) header, up to and including the
blank line that terminates it. Store it in the ENTITY.
IF the MIME type is "multipart":
LET INNER_BOUND = get multipart "boundary" from header.
LET INNER_DELIM = "--INNER_BOUND".
LET INNER_CLOSE = "--INNER_BOUND--".
Parse preamble:
REPEAT:
Read (and discard) next line
UNTIL (line is INNER_DELIM) OR we hit EOF (error).
Parse parts:
REPEAT:
LET (PART, STATE) = parse(FILEHANDLE, INNER_BOUND).
Add PART to ENTITY.
UNTIL (STATE != "DELIM").
Parse epilogue:
REPEAT (to parse epilogue):
Read (and discard) next line
UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF
LET STATE = "EOF", "DELIM", or "CLOSE" accordingly.
ELSE (if the MIME type is not "multipart"):
Open output destination (e.g., a file)
DO:
Read, decode, and output data from FILEHANDLE
UNTIL (line is OUTER_DELIM or OUTER_CLOSE) OR we hit EOF.
LET STATE = "EOF", "DELIM", or "CLOSE" accordingly.
ENDIF
RETURN (ENTITY, STATE).
END
For reasons discussed in MIME::Entity, we can't just discard the "discard text": some mailers actually put data in the preamble.
Questionable practices
- Multipart messages are always read line-by-line
-
Multipart document parts are read line-by-line, so that the encapsulation boundaries may easily be detected. However, bad MIME composition agents (for example, naive CGI scripts) might return multipart documents where the parts are, say, unencoded bitmap files... and, consequently, where such "lines" might be veeeeeeeeery long indeed.
A better solution for this case would be to set up some form of state machine for input processing. This will be left for future versions.
- Multipart parts read into temp files before decoding
-
In my original implementation, the MIME::Decoder classes had to be aware of encapsulation boundaries in multipart MIME documents. While this decode-while-parsing approach obviated the need for temporary files, it resulted in inflexible and complex decoder implementations.
The revised implementation uses a temporary file (a la
tmpfile()
) during parsing to hold the encoded portion of the current MIME document or part. This file is deleted automatically after the current part is decoded and the data is written to the "body stream" object; you'll never see it, and should never need to worry about it.Some folks have asked for the ability to bypass this temp-file mechanism, I suppose because they assume it would slow down their application. I considered accomodating this wish, but the temp-file approach solves a lot of thorny problems in parsing, and it also protects against hidden bugs in user applications (what if you've directed the encoded part into a scalar, and someone unexpectedly sends you a 6 MB tar file?). Finally, I'm just not conviced that the temp-file use adds significant overhead.
- Fuzzing of CRLF and newline on input
-
RFC-1521 dictates that MIME streams have lines terminated by CRLF (
"\r\n"
). However, it is extremely likely that folks will want to parse MIME streams where each line ends in the local newline character"\n"
instead.An attempt has been made to allow the parser to handle both CRLF and newline-terminated input.
- Fuzzing of CRLF and newline on output
-
The
"7bit"
and"8bit"
decoders will decode both a"\n"
and a"\r\n"
end-of-line sequence into a"\n"
.The
"binary"
decoder (default if no encoding specified) still outputs stuff verbatim... so a MIME message with CRLFs and no explicit encoding will be output as a text file that, on many systems, will have an annoying ^M at the end of each line... but this is as it should be. - Inability to handle multipart boundaries that contain newlines
-
First, let's get something straight: this is an evil, EVIL practice, and is incompatible with RFC-1521... hence, it's not valid MIME.
If your mailer creates multipart boundary strings that contain newlines when they appear in the message body, give it two weeks notice and find another one. If your mail robot receives MIME mail like this, regard it as syntactically incorrect MIME, which it is.
Why do I say that? Well, in RFC-1521, the syntax of a boundary is given quite clearly:
boundary := 0*69<bchars> bcharsnospace bchars := bcharsnospace / " " bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" / "+" /"_" / "," / "-" / "." / "/" / ":" / "=" / "?"
All of which means that a valid boundary string cannot have newlines in it, and any newlines in such a string in the message header are expected to be solely the result of folding the string (i.e., inserting to-be-removed newlines for readability and line-shortening only).
Yet, there is at least one brain-damaged user agent out there that composes mail like this:
MIME-Version: 1.0 Content-type: multipart/mixed; boundary="----ABC- 123----" Subject: Hi... I'm a dork! This is a multipart MIME message (yeah, right...) ----ABC- 123---- Hi there!
We have got to discourage practices like this (and the recent file upload idiocy where binary files that are part of a multipart MIME message aren't base64-encoded) if we want MIME to stay relatively simple, and MIME parsers to be relatively robust.
Thanks to Andreas Koenig for bringing a baaaaaaaaad user agent to my attention.
WARNINGS
- binmode
-
New, untested binmode() calls were added in module version 1.11... if binmode() is not a NOOP on your system, please pay careful attention to your output, and report any anomalies. It is possible that "make test" will fail on such systems, since some of the tests involve checking the sizes of the output files. That doesn't necessarily indicate a problem.
If anyone wants to test out this package's handling of both binary and textual email on a system where binmode() is not a NOOP, I would be most grateful. If stuff breaks, send me the pieces (including the original email that broke it, and at the very least a description of how the output was screwed up).
AUTHOR
Copyright (c) 1996 by Eryq / eryq@rhine.gsfc.nasa.gov
All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
VERSION
$Revision: 3.203 $ $Date: 1997/01/22 08:40:01 $