Revision history for Email-Abuse-Investigator
0.07 Mon Mar 30 08:45:04 EDT 2026
Bug fixes
- Added Route 7 to abuse_contacts(): reply addresses found in the message
body are now looked up in the provider table and generate abuse contacts.
This catches a common advance-fee and investment scam pattern where the
From: and Return-Path: headers use a spoofed or compromised address (the
innocent victim of the fraud), while the real contact address -- a free
webmail account chosen by the spammer -- is mentioned explicitly in the
body text, e.g. "contact profcindyinvestments@hotmail.com for details".
The %TRUSTED_DOMAINS filter is intentionally bypassed for this route:
being hosted on a trusted provider (Hotmail, Gmail) is what makes these
addresses actionable, since the spammer chose free webmail precisely
because it is accessible and anonymous. Recipient domains (To:, Cc:)
are still excluded. Only domains present in %PROVIDER_ABUSE generate
contacts via this route; unknown domains in the body are ignored.
The role string includes the specific address found, e.g.
"Reply address in body (profcindyinvestments@hotmail.com)", so the
abuse desk knows exactly which account to investigate.
- Fixed abuse_contacts() generating registrar contacts for innocent domains
that appear only in spoofable sending headers (From:, Return-Path:,
Sender:). These headers are trivially forged; when a spammer uses a
victim's address as the envelope sender, reporting the victim's domain
registrar is both unhelpful and potentially harmful to the innocent party.
The registrar contact is now suppressed when the domain's only source is
one of these three headers AND the same domain does not also appear as a
URL host. If the From: domain appears in a URL as well, the spammer
controls it and the registrar contact is retained. Domains sourced from
Reply-To:, DKIM-Signature:, List-Unsubscribe:, Message-ID:, or the
message body are unaffected -- those all indicate deliberate spammer
choice. Discovered via a real advance-fee spam where qwestoffice.net
was spoofed as the sender but profcindyinvestments@hotmail.com was the
actual reply address, causing a false report to CSC Global (registrar
of qwestoffice.net).
- Fixed abuse_contacts() generating spurious account-provider contacts
from SRS-rewritten Return-Path: and Sender: headers. SRS (Sender
Rewriting Scheme) addresses are generated by mail forwarders to preserve
SPF validity and take the form:
localpart+SRS=hash=timestamp=orig-domain=orig-local@forwarder
The forwarding domain is not responsible for the spam content and is a
false abuse target. Route 4 of abuse_contacts() and form_contacts() now
skips any addr-spec whose local part matches +SRS= or +SRS0= (case-
insensitive), covering both the standard SRS0 form and the re-forwarded
SRS1 variant. Discovered via a real spam message forwarded through
groups.outlook.com, which was generating an unwanted
"Account provider (return-path: ...@groups.outlook.com)" role and
routing an abuse report to abuse@microsoft.com via the wrong route.
- Fixed false positive http_not_https risk flag and spurious Gandi abuse
contacts caused by W3C namespace and DTD URLs in HTML email templates.
Spam messages sent as HTML frequently contain boilerplate references such
as:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/1999/xhtml
These are injected by the ESP's HTML template engine and have no
connection to the spam content. Two fixes applied:
1. Added w3.org to %TRUSTED_DOMAINS so it is filtered from the domain
intelligence pipeline and does not generate abuse contacts.
2. Added a trusted-domain skip at the top of the URL check loop in
risk_assessment() so that http:// references to trusted domains do
not raise the http_not_https flag. The skip uses the same $bare
(www-stripped) hostname already computed for the shortener check.
Discovered via a Gandi autoresponse explaining that W3C receives a high
volume of false positive abuse reports for this reason and maintains a
FAQ at https://www.w3.org/Help/Webmaster#spam.
Note: w3c.org (the consortium's secondary domain) is deliberately not
added -- it rarely appears in HTML boilerplate and any w3c.org URL in a
spam message would be unusual enough to warrant investigation.
0.06 Sun Mar 29 11:28:02 EDT 2026
New features
- Added form_contacts() public method, parallel to abuse_contacts(), which
returns the set of parties that require abuse reports to be submitted via
a web form rather than email. Returns a list of hashrefs each containing
the form URL, role, note, and instructions on what to paste and what file
to upload. Providers are identified via a new optional 'form',
'form_paste', and 'form_upload' key in %PROVIDER_ABUSE entries.
- Added MarkMonitor and Global Domain Group to %PROVIDER_ABUSE as
form-only entries (no 'email' key). Both registrars explicitly reject
email abuse reports per their autoresponse:
markmonitor.com ->
https://corp.markmonitor.com/domain/ui/abuse-report
globaldomaingroup.com ->
https://globaldomaingroup.com/report-abuse
The form_paste and form_upload hints in each entry tell the user exactly
what to paste into the form and what file to attach.
- Added [ WHERE TO FILE WEB-FORM REPORTS ] section to report(), appearing
after [ WHERE TO SEND ABUSE REPORTS ] when form_contacts() returns
results. Each entry shows the form URL, role, paste instructions
(word-wrapped at 66 characters), and upload instructions.
- Added WEB-FORM REPORTS REQUIRED: block to abuse_report_text() listing
form-only contacts with their form URL, paste hint, and upload hint.
- Added MANUAL ACTION REQUIRED -- WEB FORM SUBMISSION block to the
--dry-run footer of submit_abuse_report.pl, listing form-only contacts
separately from email recipients so the user knows which parties require
manual browser submission.
Bug fixes / refactoring
- Fixed abuse_contacts() passing form-only provider addresses (e.g.
abusecomplaints@markmonitor.com) through to the email contact list.
Such addresses are syntactically valid but explicitly non-functional per
the provider's own autoresponse. The $add closure now checks whether the
address domain belongs to a form-only %PROVIDER_ABUSE entry (one with a
'form' key but no 'email' key) and suppresses it; the contact is surfaced
correctly via form_contacts() instead.
- Refactored the domain WHOIS fallback in _extract_and_resolve_urls() to
call _analyse_domain() instead of the separate _parse_domain_whois_abuse()
helper introduced in 0.05. _analyse_domain() is already cached in
$self->{_domain_info}, so when the same spam domain appears in multiple
URLs the domain WHOIS is performed only once rather than once per unique
hostname. _parse_domain_whois_abuse() has been removed.
- Added use utf8; pragma to Investigator.pm so that the em-dash characters
in user-facing detail strings are handled correctly under all Perl
configurations without relying on the source file's implicit encoding.
- Added GoDaddy to %PROVIDER_ABUSE as a form-only entry. GoDaddy
explicitly rejects email abuse reports per their autoresponse, directing
reporters to their web form instead:
godaddy.com -> https://supportcenter.godaddy.com/AbuseReport
The form_upload hint reflects that GoDaddy accepts screenshots or PDFs,
not .eml files.
- Added googlegroups.com and groups.google.com to %TRUSTED_DOMAINS.
Google Groups message IDs (e.g. @googlegroups.com) were entering the
domain intelligence pipeline and triggering a MarkMonitor web-form
contact, since Google uses MarkMonitor to register its infrastructure
domains. This was a false positive -- MarkMonitor cannot act on a
Google-owned domain. Both domains are now filtered at the same point
as google.com, gmail.com, and other trusted Google infrastructure.
- Added form_domain field to form_contacts() hashrefs, surfaced as
Domain/URL in report(), Domain in abuse_report_text(), and Domain in
the --dry-run footer of submit_abuse_report.pl. Providers such as
MarkMonitor and GoDaddy have a dedicated "Domain or URL" field in their
web forms; this field gives the user the exact value to paste into it
without having to work it out from context.
- Improved clarity of role strings in abuse_contacts():
1. Removed the "(provider table)" suffix from Sending ISP, URL host,
Web host, and DKIM signer roles. This was implementation detail
that added length without helping the recipient understand what action
to take. "Sending ISP" is clearer than "Sending ISP (provider
table)"; the via field already records how the contact was found.
2. Included the hostname in URL host roles -- "URL host: host.example"
rather than the generic "URL host". When multiple routes converge on
the same abuse address (e.g. a Blogspot URL and a Gmail sending IP
both map to abuse@google.com), the merged role now makes clear which
specific URL is being reported, giving the abuse team an actionable
reference without requiring them to read the full report headers.
3. Stripped the display name from Account provider roles. The role
previously included the full From: header value including the
spammer's chosen display name (e.g. "Account provider (from: Evil
Spammer <spam@gmail.com>)"). The display name is irrelevant to the
abuse report, may contain non-ASCII characters, and makes the merged
role string much longer than necessary. The role now shows only the
email address: "Account provider (from: spam@gmail.com)".
- Fixed Wide character in syswrite error when submitting reports for
messages whose decoded subject line contains non-ASCII characters (e.g.
emoji). submit_abuse_report.pl uses "use utf8" and
"use open qw(:std :encoding(UTF-8))", which causes all string operations
to produce Unicode-flagged strings. Net::SMTP->datasend() calls
syswrite() on a raw socket and cannot handle wide characters.
_build_mime_message() now calls Encode::encode('UTF-8', ...) on the
report body and original message after CRLF normalisation, converting
the internal Unicode strings to raw byte strings before transmission.
Encode is a core module since Perl 5.8 so no new dependency is added.
0.05 Sat Mar 28 11:52:05 EDT 2026
Bug fixes
- Fixed _extract_and_resolve_urls() discarding the registrar abuse
contact for URL hosts that cannot be resolved to an IP at analysis
time. Previously, when _resolve_host() returned undef, _whois_ip()
was skipped entirely and the host was recorded with abuse=>'(unknown)',
which caused abuse_contacts() to produce no contact for that host even
though a domain WHOIS record (and therefore a registrar abuse address)
existed. _extract_and_resolve_urls() now falls back to a domain WHOIS
lookup on the registrable parent of the host when the IP WHOIS yields
no abuse address. A new private helper _parse_domain_whois_abuse()
performs this lookup without the full overhead of _analyse_domain().
Combined with the protocol-relative URL fix above, this means that the
badshamart.com spam campaign (PBS Health News / prostate supplement)
now correctly produces a registrar abuse contact in abuse_contacts()
even though all four badshamart.com URL hosts were unresolvable.
- Fixed _extract_http_urls() not extracting protocol-relative URLs
(scheme-omitted form //domain/path). These are used in spam messages
as tracking pixels and click-redirect links, e.g.:
<img src="//badshamart.com/o/2516/19142/347/US" ...>
The leading // was not matched by either the https?:// absolute-URL
regex or the HTML::LinkExtor filter, which also required a full scheme.
Both passes now recognise the //domain form and normalise it to
https://domain before adding it to the URL list. The regex pass
anchors the match to whitespace, quotes, or = to avoid false positives
on CSS path segments and HTML comments.
Discovered via a real spam message (PBS Health News / badshamart.com)
where three click-redirect hrefs and one tracking-pixel src all used
protocol-relative URLs, causing badshamart.com to be entirely absent
from embedded_urls() and therefore from abuse_contacts().
- Fixed duplicate Salesforce Marketing Cloud comment block in
%PROVIDER_ABUSE. A leftover comment fragment introduced during 0.03
appeared immediately before the real Salesforce entries, causing
cosmetic confusion in the source. Removed the orphaned fragment.
- Fixed two stale references to Mail::Message::Abuse in the SUPPORT POD
section: the perldoc command example and the CPAN Testers Dependencies
URL both still named the old module. Both now correctly reference
Email::Abuse::Investigator.
New features
- Added Blogger/Blogspot and Google Sites to the built-in provider table
alongside the existing Google entries:
blogspot.com -> abuse@google.com
blogger.com -> abuse@google.com
sites.google.com -> abuse@google.com
Blogspot is one of the most commonly abused free hosting platforms for
spam landing pages. Subdomains (e.g. ruseriver.blogspot.com) are
resolved to blogspot.com by the existing subdomain-stripping logic.
Note: google.com is in %TRUSTED_DOMAINS and is therefore excluded from
the domain intelligence pipeline; these entries are effective via the
URL-host and account-provider lookup routes in abuse_contacts().
- Documented that the {logger} constructor slot may be populated by
Object::Configure from a configuration file, allowing log output to
be routed through any Log::* compatible logger rather than STDERR.
0.04 Fri Mar 27 22:01:05 EDT 2026
Bug fixes
- Fixed abuse_contacts() silently discarding discovery routes that resolve
to an address already seen. When the same abuse address is found via
multiple routes (e.g. Google as both the sending ISP via rDNS and the
owner of a blogspot.com URL in the body), the second and subsequent
roles are now accumulated rather than dropped. Each hashref in the
returned list gains a 'roles' arrayref holding the individual role
strings, and 'role' (singular) is set to their join(' and ', ...) for
backward compatibility. The dry-run footer in submit_abuse_report.pl
now reflects this: a merged entry shows both roles on one line and the
total line reads "N recipients (M contact routes merged)" when merging
has occurred.
- Fixed _decode_multipart() not recursing into nested multipart/* parts.
A message with Content-Type: multipart/mixed containing a nested
multipart/alternative (a common structure for HTML+plaintext mail) had
its body silently discarded, causing embedded_urls() to find no URLs
and abuse_contacts() to miss all URL-host contacts. _decode_multipart()
now detects nested multipart/* parts, extracts the inner boundary from
the Content-Type header, and recurses to decode the inner container.
- Fixed abuse_contacts() section 4 (account provider lookup) incorrectly
matching the domain of an @ sign appearing in a display name rather than
the actual addr-spec. A From: header of the form:
"evil@gmail.com" <real@hotmail.com>
was matching gmail.com instead of hotmail.com. The addr-spec is now
extracted from the rightmost angle-bracket pair before the domain is
parsed; without angle brackets the whole value is used as before.
New features
- Added implausible_timezone (MEDIUM, weight 2) risk flag. Numeric
timezone offsets in the Date: header are now validated against the
real-world range of +1400 (Line Islands) to -1200 (Baker Island).
Offsets outside that range, or with a minutes field >= 60, raise this
flag. Positive and negative bounds are checked separately; a symmetric
limit would wrongly accept values such as -1300.
- Added Blogger/Blogspot and Google Sites to the built-in provider table:
blogspot.com -> abuse@google.com
blogger.com -> abuse@google.com
sites.google.com -> abuse@google.com
Blogspot subdomains (e.g. ruseriver.blogspot.com) are handled by the
existing subdomain-stripping logic.
- Added ActiveCampaign to the built-in provider table:
activecampaign.com -> abuse@activecampaign.com
ac-tinker.com -> abuse@activecampaign.com (tracking domain)
0.03 Fri Mar 27 19:54:32 EDT 2026
Bug fixes
- Fixed spurious abuse reports being sent to the registrar or ISP of the
message recipient. Bulk mailers routinely embed the recipient's email
address in the message body (personalisation footers, unsubscribe
confirmations, "this email was sent to you@example.com" lines).
_extract_and_analyse_domains() was collecting domains from the body
without first excluding the To: and Cc: recipients, causing innocent
parties to receive abuse reports. The To:, Cc:, and Received: "for"
envelope-recipient domains are now built into an exclusion set --
including their registrable eTLD+1 parents -- before any body or header
scanning takes place.
- Fixed "no abuse contacts could be determined" when analysing email
sent via Salesforce Marketing Cloud (ExactTarget). Three separate
causes were identified and corrected:
1. Salesforce Marketing Cloud was absent from the built-in provider
table. Added salesforce.com, mc.salesforce.com, exacttarget.com,
and et.exacttarget.com, all mapping to abuse@salesforce.com.
2. Non-routable hostnames such as iad4s13mta756.xt.local (injected
by Salesforce's MTA into the Message-ID) were passing through the
domain collection pipeline and consuming a WHOIS lookup slot that
could never return an actionable result. The $record closure in
_extract_and_analyse_domains() now rejects any domain whose TLD is
not at least two alphabetic characters, and explicitly rejects the
pseudo-TLDs .local, .internal, .lan, .localdomain, and .arpa.
3. When a message carries multiple DKIM-Signature headers (common
with ESPs: the first signs for the customer domain, the second
for the ESP infrastructure), _parse_auth_results_cached() took
only the first d= tag and stopped. It now collects all d= domains
and sets dkim_domain to whichever one has a hit in the provider
table -- identifying the actionable ESP -- falling back to the
first if none match. All collected domains are fed into the
domain analysis pipeline via the new dkim_domains arrayref in the
auth results hashref.
- The --dry-run output of submit_abuse_report.pl now appends a compact
recipient summary at the foot of the report:
Total: 2 recipients
abuse@tpg.com.au (Sending ISP)
abuse@godaddy.com (Domain registrar for firmluminary.com)
Previously only the count was shown. The summary allows a user to
confirm at a glance who would receive reports without scrolling back
through the full numbered table.
- submit_abuse_report now produces fully RFC 5965 (ARF) compliant
messages. The MIME structure changed from multipart/mixed (two parts)
to multipart/report; report-type=feedback-report (three parts):
Part 1 text/plain human-readable abuse report
Part 2 message/feedback-report ARF machine-readable metadata
Part 3 message/rfc822 original spam message verbatim
The feedback-report part includes Feedback-Type, Version, User-Agent,
Source-IP, Original-Mail-From, Original-Rcpt-To, Arrival-Date,
Reported-Domain, Reported-Uri (one per URL), and Authentication-Results.
0.02 Fri Mar 27 19:04:37 EDT 2026
- Added bin/submit_abuse_report
0.01 Fri Mar 27 14:23:09 EDT 2026
First draft