NAME

Email::Abuse::Investigator - Analyse spam email to identify originating hosts, hosted URLs, and suspicious domains

VERSION

Version 0.05

SYNOPSIS

use Email::Abuse::Investigator;

my $analyser = Email::Abuse::Investigator->new( verbose => 1 );
$analyser->parse_email($raw_email_text);

# Originating IP and its network owner
my $origin = $analyser->originating_ip();

# All HTTP/HTTPS URLs found in the body
my @urls  = $analyser->embedded_urls();

# All domains extracted from mailto: links and bare addresses in the body
my @mdoms = $analyser->mailto_domains();

# All domains mentioned anywhere (union of the above)
my @adoms = $analyser->all_domains();

# Full printable report
print $analyser->report();

DESCRIPTION

Email::Abuse::Investigator examines the raw source of a spam/phishing e-mail and answers the questions abuse investigators ask:

METHODS

new( %options )

Constructs and returns a new Email::Abuse::Investigator analyser object. The object is stateless until parse_email() is called; all analysis results are stored on the object and retrieved via the public accessor methods documented below.

A single object may be reused for multiple emails by calling parse_email() again: all cached state from the previous message is discarded automatically.

Usage

# Minimal -- all options take safe defaults
my $analyser = Email::Abuse::Investigator->new();

# With options
my $analyser = Email::Abuse::Investigator->new(
    timeout        => 15,
    trusted_relays => ['203.0.113.0/24', '10.0.0.0/8'],
    verbose        => 0,
);

$analyser->parse_email($raw_rfc2822_text);
my $origin   = $analyser->originating_ip();
my @urls     = $analyser->embedded_urls();
my @domains  = $analyser->mailto_domains();
my $risk     = $analyser->risk_assessment();
my @contacts = $analyser->abuse_contacts();
print $analyser->report();

Arguments

All arguments are optional named parameters passed as a flat key-value list.

Returns

A blessed Email::Abuse::Investigator object. The object is immediately usable; no network I/O is performed during construction.

Side Effects

None. The constructor performs no I/O. All network activity is deferred until the first call to a method that requires it (originating_ip(), embedded_urls(), mailto_domains(), or any method that calls them).

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
{
    timeout => {
        type     => SCALAR,
        regex    => qr/^\d+$/,
        optional => 1,
        default  => 10,
    },
    trusted_relays => {
        type     => ARRAYREF,
        optional => 1,
        default  => [],
        # Each element: exact IPv4 address or CIDR in the form a.b.c.d/n
        # where n is an integer in the range 0..32
    },
    verbose => {
        type     => SCALAR,
        regex    => qr/^[01]$/,
        optional => 1,
        default  => 0,
    },
}

Output

# Return::Set compatible specification
{
    type  => 'Email::Abuse::Investigator',  # blessed object
    isa   => 'Email::Abuse::Investigator',

    # Guaranteed slots on the returned object (public API):
    #   timeout        => non-negative integer
    #   trusted_relays => arrayref of strings
    #   verbose        => 0 or 1
    #
    # All other slots are private (_raw, _headers, etc.) and
    # must not be accessed or modified by the caller.
}

parse_email( $text )

Feeds a raw RFC 2822 email message to the analyser and prepares it for subsequent interrogation. This is the only method that must be called before any other public method; all analysis is driven by the message supplied here.

If the same object is used for a second message, calling parse_email() again completely replaces all state from the first message. No trace of the previous email survives.

Usage

# From a scalar
my $raw = do { local $/; <STDIN> };
$analyser->parse_email($raw);

# From a scalar reference (avoids copying large messages)
$analyser->parse_email(\$raw);

# Chained with new()
my $analyser = Email::Abuse::Investigator->new()->parse_email($raw);

# Re-use the same object for multiple messages
while (my $msg = $queue->next()) {
    $analyser->parse_email($msg->raw_text());
    my $risk = $analyser->risk_assessment();
    report_if_spam($analyser) if $risk->{level} ne 'INFO';
}

Arguments

Returns

The object itself ($self), allowing method chaining:

my $origin = Email::Abuse::Investigator->new()->parse_email($raw)->originating_ip();

Side Effects

The following work is performed synchronously, with no network I/O:

All network I/O (DNS lookups, WHOIS/RDAP queries) is deferred; it occurs only when a caller first invokes originating_ip(), embedded_urls(), or mailto_domains().

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# (positional argument, not named)
[
    {
        type => SCALAR | SCALARREF,
        # SCALAR:    the complete raw email text
        # SCALARREF: reference to the complete raw email text;
        #            the referent must be a defined string
        # Both LF and CRLF line endings are accepted.
    },
]

Output

# Return::Set compatible specification
{
    type => 'Email::Abuse::Investigator',  # the invocant, returned for chaining
    isa  => 'Email::Abuse::Investigator',

    # Guaranteed post-conditions on the returned object:
    #   sending_software()  returns a (possibly empty) list
    #   received_trail()    returns a (possibly empty) list
    #   All lazy-analysis caches are reset (undef or empty)
    #   _raw contains the verbatim input text
}

originating_ip()

Identifies the IP address of the machine that originally injected the message into the mail system, as opposed to any intermediate relay that passed it along. This is the address of the spammer's machine, their ISP's outbound mail server, or a compromised host -- the primary target for an ISP abuse report.

The method walks the Received: chain from oldest to newest, skips every hop whose IP is in a private, reserved, or trusted range, and returns the first remaining (external) IP, enriched with reverse DNS, network ownership, and abuse contact information gathered via rDNS, RDAP, and WHOIS.

If no usable IP can be found in the Received: chain, the method falls back to the X-Originating-IP header injected by some webmail providers.

The result is computed once and cached; subsequent calls on the same object return the same hashref without repeating any network I/O.

Usage

$analyser->parse_email($raw);
my $orig = $analyser->originating_ip();

if (defined $orig) {
    printf "Origin: %s (%s)\n",   $orig->{ip},  $orig->{rdns};
    printf "Owner:  %s\n",        $orig->{org};
    printf "Abuse:  %s\n",        $orig->{abuse};
    printf "Confidence: %s\n",    $orig->{confidence};
} else {
    print "Could not determine originating IP.\n";
}

# Confidence-gated reporting
if (defined $orig && $orig->{confidence} eq 'high') {
    send_abuse_report($orig->{abuse}, $analyser->abuse_report_text());
}

Arguments

None. parse_email() must have been called first.

Returns

{
  ip         => '209.85.218.67',
  rdns       => 'mail-ej1-f67.google.com',
  org        => 'Google LLC',
  abuse      => 'network-abuse@google.com',
  confidence => 'high',
  note       => 'First external hop in Received: chain',
}

On success, a hashref with the following keys (all always present):

Returns undef if no suitable originating IP can be determined (no Received: headers, all IPs are private or trusted, no usable X-Originating-IP header, or parse_email() has not been called).

Side Effects

The first call (or the first call after a parse_email()) performs the following network I/O, subject to the timeout set at construction:

All subsequent calls return the cached hashref. The cache is invalidated by parse_email().

Algorithm: Received: chain traversal

The Received: headers are walked from bottom (oldest) to top (most recent). For each header, the first IPv4 address is extracted in priority order:

An extracted IP is discarded if it:

All non-discarded IPs are collected; the first (oldest) one is reported as the origin. The count of non-discarded IPs determines the confidence level.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments; invocant must be a Email::Abuse::Investigator object
# on which parse_email() has previously been called.
[]

Output

# Return::Set compatible specification

# On success:
{
    type => HASHREF,
    keys => {
        ip => {
            type  => SCALAR,
            regex => qr/^\d{1,3}(?:\.\d{1,3}){3}$/,  # dotted-quad IPv4
        },
        rdns => {
            type  => SCALAR,
            # hostname string, or the literal '(no reverse DNS)'
        },
        org => {
            type  => SCALAR,
            # organisation name, or the literal '(unknown)'
        },
        abuse => {
            type  => SCALAR,
            # email address, or the literal '(unknown)'
        },
        confidence => {
            type  => SCALAR,
            regex => qr/^(?:high|medium|low)$/,
        },
        note => {
            type => SCALAR,
        },
        country => {
            type     => SCALAR,
            optional => 1,  # present but may be undef
            regex    => qr/^[A-Z]{2}$/,
        },
    },
}

# On failure (no usable IP found):
undef

embedded_urls()

Extracts every HTTP and HTTPS URL from the message body and enriches each one with the hosting IP address, network organisation name, abuse contact, and country code of the web server it points to.

URL extraction runs across both the plain-text and HTML parts of the message. When HTML::LinkExtor is available, HTML href, src, and action attributes are parsed structurally; a plain-text regex pass then catches any remaining bare URLs in both parts.

Each unique URL is returned as a separate hashref. When multiple distinct URLs share the same hostname, DNS resolution and WHOIS are performed only once for that hostname; all URLs on that host share the cached result.

The result list is computed once and cached; subsequent calls on the same object return the same data without repeating any network I/O.

Usage

$analyser->parse_email($raw);
my @urls = $analyser->embedded_urls();

if (@urls) {
    for my $u (@urls) {
        printf "URL:   %s\n", $u->{url};
        printf "Host:  %s  IP: %s\n", $u->{host}, $u->{ip};
        printf "Owner: %s\n", $u->{org};
        printf "Abuse: %s\n", $u->{abuse};
        print  "\n";
    }
} else {
    print "No HTTP/HTTPS URLs found.\n";
}

# Collect unique abuse contacts from URL hosts
my %seen;
my @url_contacts = grep { !$seen{$_}++ }
                   map  { $_->{abuse} }
                   grep { $_->{abuse} ne '(unknown)' }
                   @urls;

# Check for URL shorteners
my @shorteners = grep { $_->{host} =~ /bit\.ly|tinyurl/i } @urls;
warn "Message contains URL shortener(s)\n" if @shorteners;

Arguments

None. parse_email() must have been called first.

Returns

A list (not an arrayref) of hashrefs, one per unique URL found in the body, in the order they were first encountered. Returns an empty list if the body contains no HTTP or HTTPS URLs, or if parse_email() has not been called.

{
    url   => 'https://spamsite.example/offer',
    host  => 'spamsite.example',
    ip    => '198.51.100.7',
    org   => 'Dodgy Hosting Ltd',
    abuse => 'abuse@dodgy.example',
}

Each hashref contains the following keys (all always present):

Side Effects

The first call (or first call after parse_email()) performs network I/O for each unique hostname found, subject to the timeout set at construction. For each unique hostname:

DNS and WHOIS are performed at most once per unique hostname per parse_email() call, regardless of how many distinct URLs share that hostname. All subsequent calls return the cached list. The cache is invalidated by parse_email().

Algorithm: URL extraction

URLs are extracted from the concatenation of the decoded plain-text body and the decoded HTML body, in that order. The two extraction passes are:

After both passes, the combined list is deduplicated (preserving first-seen order) and trailing punctuation is stripped from each URL. The host is then extracted and used as a cache key for DNS and WHOIS lookups.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification

# A list (possibly empty) of hashrefs:
(
    {
        type => HASHREF,
        keys => {
            url => {
                type  => SCALAR,
                regex => qr{^https?://}i,
            },
            host => {
                type  => SCALAR,
                # hostname without port; no leading scheme
            },
            ip => {
                type  => SCALAR,
                # dotted-quad IPv4, or the literal '(unresolved)'
            },
            org => {
                type  => SCALAR,
                # organisation name, or the literal '(unknown)'
            },
            abuse => {
                type  => SCALAR,
                # email address, or the literal '(unknown)'
            },
            country => {
                type     => SCALAR,
                optional => 1,  # present but may be undef
                regex    => qr/^[A-Z]{2}$/,
            },
        },
    },
    # ... one hashref per unique URL, in first-seen order
)

# Empty list when no HTTP/HTTPS URLs are present in the body.

mailto_domains()

Identifies every domain associated with the message as a contact, reply, or delivery address, then runs a full intelligence pipeline on each one to determine who hosts its web server, who handles its mail, who operates its DNS, and who registered it.

This answers POD description item 3: "Who owns the reply-to / contact domains?" A spammer may use one sending IP but route replies through an entirely different organisation's infrastructure. This method surfaces all of those parties so each can be contacted independently.

The result is computed once and cached; subsequent calls on the same object return the same list without repeating any network I/O.

Usage

$analyser->parse_email($raw);
my @domains = $analyser->mailto_domains();

for my $d (@domains) {
    printf "Domain : %s  (found in %s)\n", $d->{domain}, $d->{source};
    printf "  Web  : %s  owned by %s\n",   $d->{web_ip}  // 'none',
                                            $d->{web_org} // 'unknown';
    printf "  MX   : %s\n", $d->{mx_host} // 'none';
    printf "  Reg  : %s  (registered %s)\n", $d->{registrar}  // 'unknown',
                                              $d->{registered} // 'unknown';
    if ($d->{recently_registered}) {
        print  "  *** RECENTLY REGISTERED -- possible phishing domain ***\n";
    }
    print "\n";
}

# Collect registrar abuse contacts
my @reg_contacts = map  { $_->{registrar_abuse} }
                   grep { defined $_->{registrar_abuse} }
                   @domains;

# Find recently registered domains
my @fresh = grep { $_->{recently_registered} } @domains;

Arguments

None. parse_email() must have been called first.

Returns

A list (not an arrayref) of hashrefs, one per unique non-infrastructure domain, in the order each domain was first encountered across all sources. Returns an empty list if no qualifying domains are found, or if parse_email() has not been called.

{
    domain      => 'sminvestmentsupplychain.com',
    source      => 'mailto in body',

    # Web hosting
    web_ip      => '104.21.30.10',
    web_org     => 'Cloudflare Inc',
    web_abuse   => 'abuse@cloudflare.com',

    # Mail hosting (MX)
    mx_host     => 'mail.example.com',
    mx_ip       => '198.51.100.5',
    mx_org      => 'Hosting Corp',
    mx_abuse    => 'abuse@hostingcorp.example',

    # DNS authority (NS)
    ns_host     => 'ns1.example.com',
    ns_ip       => '198.51.100.1',
    ns_org      => 'DNS Provider Inc',
    ns_abuse    => 'abuse@dnsprovider.example',

    # Domain registration (WHOIS)
    registrar   => 'GoDaddy.com LLC',
    registered  => '2024-11-01',
    expires     => '2025-11-01',
    recently_registered => 1,   # flag: < 180 days old

    # Raw domain WHOIS text (first 2 KB)
    whois_raw   => '...',
}

Each hashref contains the following keys. Keys marked "(optional)" are absent from the hashref when the corresponding information is unavailable; test with exists $d->{key} or defined $d->{key} as appropriate.

Side Effects

The first call (or first call after parse_email()) performs network I/O for each unique domain collected, subject to the timeout set at construction. For each domain:

In the worst case (all records present, all IPs distinct, RDAP unavailable), each domain incurs: 3 A lookups + 1 MX lookup + 1 NS lookup + 3 WHOIS IP queries (6 TCP connections each) + 2 domain WHOIS queries (2 TCP connections) = up to 17 network operations. In practice, shared hosting and cached DNS reduce this considerably.

All results are cached per domain within a single parse_email() lifetime. The cache is invalidated by parse_email().

Domain collection sources

Domains are collected from the following sources, in this order. A domain that appears in multiple sources is recorded only once, with the source label of its first occurrence.

In all cases, domain names are lower-cased, trailing dots are stripped, and domains in the infrastructure exclusion list are silently discarded.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification

# A list (possibly empty) of hashrefs, one per domain:
(
    {
        type => HASHREF,
        keys => {
            # Always present:
            domain => { type => SCALAR },
            source => { type => SCALAR },

            # Optional -- absent when information is unavailable:
            web_ip    => { type => SCALAR, optional => 1,
                           regex => qr/^\d{1,3}(?:\.\d{1,3}){3}$/ },
            web_org   => { type => SCALAR, optional => 1 },
            web_abuse => { type => SCALAR, optional => 1 },

            mx_host  => { type => SCALAR, optional => 1 },
            mx_ip    => { type => SCALAR, optional => 1,
                          regex => qr/^\d{1,3}(?:\.\d{1,3}){3}$/ },
            mx_org   => { type => SCALAR, optional => 1 },
            mx_abuse => { type => SCALAR, optional => 1 },

            ns_host  => { type => SCALAR, optional => 1 },
            ns_ip    => { type => SCALAR, optional => 1,
                          regex => qr/^\d{1,3}(?:\.\d{1,3}){3}$/ },
            ns_org   => { type => SCALAR, optional => 1 },
            ns_abuse => { type => SCALAR, optional => 1 },

            registrar       => { type => SCALAR, optional => 1 },
            registrar_abuse => { type => SCALAR, optional => 1 },

            registered => { type => SCALAR, optional => 1,
                            regex => qr/^\d{4}-\d{2}-\d{2}$/ },
            expires    => { type => SCALAR, optional => 1,
                            regex => qr/^\d{4}-\d{2}-\d{2}$/ },

            recently_registered => { type => SCALAR, optional => 1,
                                     regex => qr/^1$/ },

            whois_raw => { type => SCALAR, optional => 1 },
        },
    },
    # ... one hashref per unique domain, in first-seen order
)

# Empty list when no qualifying domains are found.

all_domains()

Returns the union of every registrable domain seen anywhere in the message: URL hosts from embedded_urls() and contact domains from mailto_domains(), collapsed to their registrable eTLD+1 form and deduplicated.

This is the high-level answer to "what domains does this message reference?" It is suitable for bulk lookups, domain reputation checks, or feeds into external threat-intelligence systems where you want a flat, deduplicated list rather than the detailed per-domain hashrefs returned by the individual methods.

Unlike mailto_domains(), this method triggers no additional network I/O beyond what embedded_urls() and mailto_domains() already perform; it is a pure in-memory union and normalisation of their results.

Usage

$analyser->parse_email($raw);
my @domains = $analyser->all_domains();

# Print every unique registrable domain
print "$_\n" for @domains;

# Feed into a reputation lookup
for my $dom (@domains) {
    my $score = $reputation_api->lookup($dom);
    warn "Known bad domain: $dom\n" if $score > 0.8;
}

# Check for overlap with a known-bad domain list
my %blocklist = map { $_ => 1 } @known_bad_domains;
my @hits = grep { $blocklist{$_} } @domains;

Arguments

None. parse_email() must have been called first. Calling all_domains() before embedded_urls() or mailto_domains() is safe; it will trigger both lazily.

Returns

A list (not an arrayref) of plain strings, each being a registrable eTLD+1 domain name (see Algorithm below), lower-cased, with no duplicates, in first-seen order. Returns an empty list if the message contains no URLs and no contact domains, or if parse_email() has not been called.

The list contains plain scalars, not hashrefs. For the full intelligence detail associated with each domain, call embedded_urls() and mailto_domains() directly.

Side Effects

Triggers embedded_urls() and mailto_domains() if they have not already been called on the current message, which in turn performs network I/O as documented in those methods. No additional network I/O is performed beyond what those two methods require. Results are not independently cached; the caching is handled by embedded_urls() and mailto_domains().

Algorithm: eTLD+1 normalisation

Both input sources are normalised to their registrable domain (eTLD+1) before deduplication, using the following heuristic:

This heuristic handles the most common cases correctly. It is not a full Public Suffix List implementation; uncommon second-level delegations (e.g. .ltd.uk, .plc.uk, .asn.au) are not recognised and will produce a two-label result that includes the second-level label rather than three labels.

The normalisation is applied to both sources:

This means a URL at www.spamco.example and a contact address at sub.spamco.example both collapse to spamco.example, and that domain appears only once in the result.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification

# A list (possibly empty) of plain strings:
(
    {
        type  => SCALAR,
        regex => qr/^[a-z0-9](?:[a-z0-9.-]*[a-z0-9])?$/,
        # Lower-cased registrable domain; no trailing dot;
        # at least two dot-separated labels.
    },
    # ... one string per unique registrable domain, in first-seen order
)

# Empty list when the message contains no URLs and no contact domains.

sending_software()

Returns information extracted from headers that identify the software or server-side infrastructure used to compose or inject the message. These headers are injected by email clients, bulk-mailing libraries, and shared hosting control panels, and are often the most direct evidence of how the spam was sent and from which server.

Headers examined: X-Mailer, User-Agent, X-PHP-Originating-Script, X-Source, X-Source-Args, X-Source-Host.

The X-PHP-Originating-Script, X-Source, and X-Source-Host headers in particular are injected automatically by many shared hosting providers (cPanel, Plesk, DirectAdmin) and reveal the exact PHP script path and hostname responsible. A hosting abuse team can use these values to identify the compromised or malicious account immediately, without needing to search logs.

The data is extracted synchronously during parse_email() with no network I/O. This method simply returns the pre-built list.

Usage

$analyser->parse_email($raw);
my @sw = $analyser->sending_software();

for my $s (@sw) {
    printf "%-30s : %s\n", $s->{header}, $s->{value};
    printf "  Note: %s\n", $s->{note};
}

# Check for shared-hosting injection headers
my @hosting = grep {
    $_->{header} =~ /^x-(?:php-originating-script|source)/
} @sw;

if (@hosting) {
    print "Shared-hosting script detected -- report to hosting abuse team:\n";
    print "  $_->{header}: $_->{value}\n" for @hosting;
}

# Extract the mailer name if present
my ($mailer) = grep { $_->{header} eq 'x-mailer' } @sw;
printf "Sent with: %s\n", $mailer->{value} if $mailer;

Arguments

None. parse_email() must have been called first.

Returns

A list (not an arrayref) of hashrefs, one per recognised software-fingerprint header that was present in the message, in alphabetical order of header name. Returns an empty list if none of the watched headers are present, or if parse_email() has not been called.

{
    header => 'X-PHP-Originating-Script',
    value  => '1000:newsletter.php',
    note   => 'PHP script on shared hosting - report to hosting abuse team',
}

Each hashref contains exactly three keys, all always present:

Side Effects

None. All data is collected during parse_email() and this method only returns the pre-collected list. No network I/O is performed.

Algorithm: headers examined

The following six headers are examined during parse_email(). They are checked in alphabetical order; the result list preserves that order (i.e. user-agent appears before x-mailer which appears before x-php-originating-script, etc.). At most one entry per header name is produced even if the header appears more than once; the first occurrence is used.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification

# A list (possibly empty) of hashrefs, in alphabetical header-name order:
(
    {
        type => HASHREF,
        keys => {
            header => {
                type  => SCALAR,
                regex => qr/^(?:user-agent|x-mailer|x-php-originating-script
                               |x-source|x-source-args|x-source-host)$/x,
            },
            value => {
                type => SCALAR,
                # Verbatim header value; may be any non-empty string.
            },
            note => {
                type  => SCALAR,
                # Fixed annotation string; one of the six strings
                # documented in the Algorithm section above.
            },
        },
    },
    # ... one hashref per recognised header present, alphabetical order
)

# Empty list when none of the six watched headers are present.

received_trail()

Returns the per-hop tracking data extracted from the Received: header chain: the IP address, envelope recipient address, and server-assigned session ID for each relay that handled the message.

When filing an abuse report with a transit ISP or relay operator, these are the identifiers their postmaster team needs to look up the specific SMTP session in their mail logs. Without the session ID or envelope recipient, an ISP typically cannot locate a single message among billions of log entries; with them, the lookup takes seconds.

The data is extracted synchronously during parse_email() with no network I/O. This method simply returns the pre-built list.

Usage

$analyser->parse_email($raw);
my @trail = $analyser->received_trail();

for my $hop (@trail) {
    printf "Hop IP : %s\n",  $hop->{ip}       // '(unknown)';
    printf "  For  : %s\n",  $hop->{for}       if defined $hop->{for};
    printf "  ID   : %s\n",  $hop->{id}        if defined $hop->{id};
    printf "  Raw  : %s\n",  $hop->{received};
    print  "\n";
}

# Build a list of session IDs to include in an abuse report
my @ids = map  { "$_->{ip}: id $_->{id}" }
          grep { defined $_->{id} }
          @trail;

# Find which ISP handled a particular relay IP
my ($hop) = grep { ($_->{ip} // '') eq '91.198.174.5' } @trail;
if ($hop) {
    print "Session ID at that relay: $hop->{id}\n" if defined $hop->{id};
}

Arguments

None. parse_email() must have been called first.

Returns

A list (not an arrayref) of hashrefs, one per Received: hop from which at least one of an IP address, an envelope recipient address, or a server session ID could be extracted, in oldest-first order (i.e. the first element is the outermost relay, the last element is the most recent hop before your own server). Returns an empty list if no Received: headers are present or none yielded any extractable data, or if parse_email() has not been called.

(
  { received => '...raw header...', ip => '1.2.3.4',
    for => 'victim@example.com', id => 'ABC123' },
  ...
)

Each hashref contains exactly four keys:

Side Effects

None. All data is collected during parse_email() and this method only returns the pre-collected list. No network I/O is performed.

Algorithm: extraction and ordering

During parse_email(), the Received: headers are walked in reverse message order (i.e. oldest hop first, which is the same order as originating_ip()'s chain walk). For each header:

  1. The IP address is extracted using the same four-pattern priority sequence documented in originating_ip().
  2. The envelope recipient is extracted with the pattern \bfor\s+<?([^\s]+@[\w.-]+\.[\w]+)>?> (case-insensitive). The domain portion of the address must contain at least one dot; single-label names such as postmaster are not matched.
  3. The session ID is extracted with the pattern \bid\s+([\w.-]+) (case-insensitive), capturing the first word-character token following the keyword id.
  4. If none of the three fields can be extracted (all are undef), the hop is silently discarded and does not appear in the result list. This suppresses internal or synthetic hops that carry no useful tracking information.

The result list therefore contains only hops that carry at least one actionable piece of tracking data.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification

# A list (possibly empty) of hashrefs, oldest-hop first:
(
    {
        type => HASHREF,
        keys => {
            received => {
                type => SCALAR,
                # Complete unfolded Received: header value; always defined.
            },
            ip => {
                type     => SCALAR,
                optional => 1,  # present but may be undef
                regex    => qr/^\d{1,3}(?:\.\d{1,3}){3}$/,
            },
            for => {
                type     => SCALAR,
                optional => 1,  # present but may be undef
                # Fully-qualified email address: local@domain.tld
                regex    => qr/^[^\s@]+\@[\w.-]+\.[a-zA-Z]{2,}$/,
            },
            id => {
                type     => SCALAR,
                optional => 1,  # present but may be undef
                regex    => qr/^[\w.-]+$/,
            },
        },
    },
    # ... one hashref per hop with at least one extractable field,
    #     in oldest-first (outermost relay first) order
)

# Empty list when no Received: headers are present or none yielded
# any extractable data.

risk_assessment()

Evaluates the message against a set of heuristic checks and returns an overall risk level, a weighted numeric score, and a list of every specific red flag that contributed to the score.

The assessment covers five categories: originating IP characteristics, email authentication results, Date: header validity, identity and header consistency, and URL and domain properties. Each finding is assigned a severity, a machine-readable flag name, and a human-readable detail string.

The result is computed once and cached; subsequent calls on the same object return the same hashref without repeating any analysis. Calling risk_assessment() also implicitly triggers originating_ip(), embedded_urls(), and mailto_domains() if they have not already been called, performing all associated network I/O.

Usage

$analyser->parse_email($raw);
my $risk = $analyser->risk_assessment();

printf "Risk level : %s (score: %d)\n", $risk->{level}, $risk->{score};

for my $f (@{ $risk->{flags} }) {
    printf "  [%-6s] %s\n    %s\n",
        $f->{severity}, $f->{flag}, $f->{detail};
}

# Gate an automated report on HIGH level only
if ($risk->{level} eq 'HIGH') {
    send_abuse_report($analyser->abuse_report_text());
}

# Collect only HIGH and MEDIUM flags for a summary
my @significant = grep { $_->{severity} =~ /^(?:HIGH|MEDIUM)$/ }
                  @{ $risk->{flags} };

# Check for a specific flag
my ($flag) = grep { $_->{flag} eq 'recently_registered_domain' }
             @{ $risk->{flags} };
warn "Phishing domain suspected\n" if $flag;

# INFO level means no actionable red flags
if ($risk->{level} eq 'INFO') {
    print "No significant red flags detected.\n";
}

Arguments

None. parse_email() must have been called first.

Returns

Returns a hashref with an overall risk level and a list of specific red flags found in the message:

{
    level => 'HIGH',          # HIGH | MEDIUM | LOW | INFO
    score => 7,               # raw weighted score
    flags => [
        { severity => 'HIGH',   flag => 'recently_registered_domain',
          detail => 'firmluminary.com registered 2025-09-01 (< 180 days ago)' },
        { severity => 'MEDIUM', flag => 'residential_sending_ip',
          detail => 'rDNS 120-88-161-249.tpgi.com.au looks like a broadband line' },
        { severity => 'MEDIUM', flag => 'url_shortener',
          detail => 'bit.ly used - real destination hidden' },
        ...
    ],
}

A hashref with exactly three keys, all always present:

Side Effects

The first call triggers originating_ip(), embedded_urls(), and mailto_domains() if they have not already run on the current message. Each of those methods may perform network I/O as documented in their own entries. Specifically:

All results are cached. Subsequent calls to risk_assessment() on the same object return the cached hashref immediately. The cache is invalidated by parse_email().

Algorithm: flags and scoring

The following flags may be raised. They are evaluated in five groups, in the order shown. The same flag name is never raised more than once per message.

Group 1 -- Originating IP (requires originating_ip() to return a result):

Group 2 -- Email authentication (from Authentication-Results: header):

Group 3 -- Date: header:

Group 4 -- Header identity and consistency:

Group 5 -- URLs and domains (from embedded_urls() and mailto_domains()):

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification
{
    type => HASHREF,
    keys => {
        level => {
            type  => SCALAR,
            regex => qr/^(?:HIGH|MEDIUM|LOW|INFO)$/,
        },
        score => {
            type  => SCALAR,
            regex => qr/^\d+$/,  # non-negative integer
        },
        flags => {
            type => ARRAYREF,
            # Reference to a list (possibly empty) of hashrefs:
            # [
            #   {
            #     severity => qr/^(?:HIGH|MEDIUM|LOW|INFO)$/,
            #     flag     => qr/^[a-z][a-z0-9_]+$/,
            #     detail   => SCALAR,  # human-readable string
            #   },
            #   ...
            # ]
        },
    },
}

abuse_report_text()

Produces a compact, plain-text string intended to be sent as the body of an abuse report email to an ISP or hosting provider. It summarises the risk level, lists every red flag with its detail, identifies the originating IP and its network owner, lists the abuse contacts, and appends the complete message headers so the recipient can trace the session in their own logs.

The message body is intentionally omitted to keep the report concise. Headers are sufficient for an ISP to locate the relevant mail session; the body adds bulk without aiding the investigation.

This method is the companion to abuse_contacts(): call abuse_contacts() to obtain the addresses to send the report to, and abuse_report_text() to obtain the text to send. Use report() instead when you want a comprehensive analyst-facing document rather than a send-ready ISP report.

Usage

$analyser->parse_email($raw);

my $text     = $analyser->abuse_report_text();
my @contacts = $analyser->abuse_contacts();

for my $c (@contacts) {
    send_email(
        to      => $c->{address},
        subject => 'Abuse report: ' . ($analyser->originating_ip()->{ip} // 'unknown'),
        body    => $text,
    );
}

# Print to stdout for manual review before sending
print $text;

# Write to file for a ticketing system
open my $fh, '>', 'abuse_report.txt' or die $!;
print $fh $text;
close $fh;

Arguments

None. parse_email() must have been called first.

Returns

A plain scalar string containing the report text. The string is newline-terminated and uses Unix line endings (\n) throughout. The string is never empty; it always contains at least the boilerplate introduction and the risk-level line, even if no red flags were found.

The report is structured as follows, in order:

ORIGINAL MESSAGE HEADERS:

Side Effects

Calls risk_assessment(), originating_ip(), and abuse_contacts() if they have not already run, which in turn may perform network I/O as documented in those methods. All results are cached; the text is not itself cached, but re-computing it is cheap since all the underlying data is already cached.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification
{
    type  => SCALAR,
    # Non-empty plain-text string, newline-terminated.
    # Always defined; never undef.
    # Line endings: Unix LF (\n) only.
    # Minimum content: introduction + risk-level line.
}

abuse_contacts()

Collates the complete set of parties that should receive an abuse report for this message: the ISP that owns the sending IP, the operators of every URL host, the web, mail, and DNS hosts of every contact domain, each domain's registrar, the webmail or ESP account provider identified from key headers, the DKIM signing organisation, and the ESP identified via the List-Unsubscribe: header.

For each party the method produces the role description, the abuse email address, a supporting note, and the source of the information. Addresses are deduplicated globally: if the same address is discovered through multiple routes (e.g. Google as both the sending ISP and the owner of a blogspot.com URL in the message body), it appears only once. The role string for that entry is the combined description of all routes that found it, joined by " and ", and the roles key holds the individual role strings as an arrayref.

This method is designed to be used together with abuse_report_text(): iterate over the returned contacts to obtain the list of addresses, and send the text from abuse_report_text() to each one.

Usage

$analyser->parse_email($raw);
my @contacts = $analyser->abuse_contacts();

for my $c (@contacts) {
    printf "Role    : %s\n", $c->{role};
    printf "Send to : %s\n", $c->{address};
    printf "Note    : %s\n", $c->{note}  if $c->{note};
    printf "Source  : %s\n", $c->{via};
    print  "\n";
}

# Collect addresses for sending
my @addresses = map { $_->{address} } @contacts;

# Filter to WHOIS-discovered contacts only
my @whois_contacts = grep { $_->{via} =~ /whois/ } @contacts;

# Check whether any registrar abuse contacts were found
my @registrar = grep { $_->{role} =~ /registrar/ } @contacts;

Arguments

None. parse_email() must have been called first.

Returns

A list (not an arrayref) of hashrefs, one per unique abuse contact address, in the order they were first discovered. Returns an empty list if no actionable abuse contacts can be determined, or if parse_email() has not been called.

Returns a de-duplicated list of hashrefs, one per party that should receive an abuse report, in priority order:

{
    role    => 'Sending ISP',          # human-readable role
    address => 'abuse@senderisp.example',
    note    => 'IP block 120.88.0.0/14 owner',
    via     => 'ip-whois',             # ip-whois | domain-whois | provider-table | rdap
}

Roles produced (in order):

Sending ISP            - network owner of the originating IP
URL host               - network owner of each unique web-server IP
Mail host (MX)         - network owner of the domain's MX record IP
DNS host (NS)          - network owner of the authoritative NS IP
Domain registrar       - registrar abuse contact from domain WHOIS
Account provider       - e.g. Gmail / Outlook for the From:/Sender: account
DKIM signer            - the organisation whose key signed the message
ESP / bulk sender      - identified via List-Unsubscribe: domain

Addresses are deduplicated so the same address never appears twice, even if it is discovered through multiple routes.

Each hashref contains the following keys, all always present:

Side Effects

Triggers originating_ip(), embedded_urls(), and mailto_domains() if they have not already run on the current message, performing all associated network I/O as documented in those methods. Additionally consults the built-in provider table and the cached authentication results; neither requires network I/O.

The result is not independently cached. Each call recomputes the contact list from the cached results of the underlying methods. Because those results are cached, subsequent calls are fast (no network I/O), but they do re-execute the collation and deduplication logic.

Algorithm: discovery routes

Contacts are discovered through six routes, applied in order. Deduplication is global across all routes: if an address is found by more than one route, a single entry is kept and the role strings from every route that found it are accumulated into roles and joined into role. An entry is suppressed entirely if its address is empty, does not contain an @ sign, or is the sentinel '(unknown)'.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification

# A list (possibly empty) of hashrefs, in discovery order:
(
    {
        type => HASHREF,
        keys => {
            role => {
                type => SCALAR,
                # Human-readable role description; always defined,
                # may contain inline domain/IP/header values.
            },
            address => {
                type  => SCALAR,
                regex => qr/^[^\s@]+\@[^\s@]+$/,
                # Lower-cased email address; unique across the list.
            },
            note => {
                type => SCALAR,
                # Supporting detail; always defined, may be empty string.
            },
            via => {
                type  => SCALAR,
                regex => qr/^(?:provider-table|ip-whois|domain-whois)$/,
            },
        },
    },
    # ... one hashref per unique address, in first-discovered order
)

# Empty list when no actionable abuse contacts can be determined.

report()

Returns a formatted plain-text abuse report.

Produces a comprehensive, analyst-facing plain-text report covering all findings from every analysis method. It is the single-document summary of everything the module knows about a message: envelope fields, risk assessment, originating host, sending software, received chain tracking IDs, embedded URLs grouped by hostname, contact domain intelligence, and the recommended abuse contacts.

Use report() when you want a human-readable document for review, logging, or a ticketing system. Use abuse_report_text() when you want a compact string to transmit to an ISP abuse desk.

Usage

$analyser->parse_email($raw);
my $text = $analyser->report();
print $text;

# Write to a file
open my $fh, '>', 'report.txt' or die $!;
print $fh $analyser->report();
close $fh;

# Log the risk level line from the report
my ($level_line) = $analyser->report() =~ /(\[ RISK ASSESSMENT: [^\]]+\])/;
$logger->info($level_line);

# Check idempotency -- safe to call multiple times
my $r1 = $analyser->report();
my $r2 = $analyser->report();
# $r1 eq $r2 is always true for the same parsed message

Arguments

None. parse_email() must have been called first.

Returns

A plain scalar string containing the full report, newline-terminated, using Unix line endings (\n) throughout. The string is never empty; it always contains at least the header banner and envelope summary section.

The report is structured as nine sections separated by blank lines, in this fixed order:

======================================================================== Email::Abuse::Investigator Report (vX.XX)

The report ends with a closing row of 72 equals signs.

Side Effects

Calls risk_assessment(), originating_ip(), sending_software(), received_trail(), embedded_urls(), mailto_domains(), and abuse_contacts() if they have not already run on the current message, performing all associated network I/O as documented in those methods. All underlying results are cached; the report text itself is not cached, but re-computation is inexpensive since the data is already available.

Notes

API Specification

Input

# Params::Validate::Strict compatible specification
# No arguments.
[]

Output

# Return::Set compatible specification
{
    type  => SCALAR,
    # Non-empty plain-text string, newline-terminated (\n).
    # Always defined; never undef.
    # Line endings: Unix LF (\n) only.
    # Structure: nine fixed sections in the order documented above,
    #            separated by blank lines, framed by 72-character
    #            equals-sign banners.
}

AUTHOR

Nigel Horne, <njh at nigelhorne.com>

ALGORITHM: DOMAIN INTELLIGENCE PIPELINE

For each unique non-infrastructure domain found in the email, the module runs the following pipeline:

Domain name
    |
    +-- A record  --> web hosting IP  --> RDAP --> org + abuse contact
    |
    +-- MX record --> mail server hostname --> A --> RDAP --> org + abuse
    |
    +-- NS record --> nameserver hostname  --> A --> RDAP --> org + abuse
    |
    +-- WHOIS (TLD whois server via IANA referral)
           +-- Registrar name + abuse contact
           +-- Creation date  (-> recently-registered flag if < 180 days)
           +-- Expiry date    (-> expires-soon or expired flags)

Domains are collected from:

From:/Reply-To:/Sender:/Return-Path: headers
DKIM-Signature: d=  (signing domain)
List-Unsubscribe:   (ESP / bulk sender domain)
Message-ID:         (often reveals real sending platform)
mailto: links and bare addresses in the body

WHY WEB HOSTING != MAIL HOSTING != DNS HOSTING

A fraudster registering sminvestmentsupplychain.com might:

Each of these parties has an abuse contact, and each can independently take action to disrupt the spam/phishing operation. The module reports all of them separately.

RECENTLY-REGISTERED FLAG

Phishing domains are very commonly registered hours or days before the spam run. The module flags any domain whose WHOIS creation date is less than 180 days ago with recently_registered => 1.

SEE ALSO

Net::DNS, LWP::UserAgent, HTML::LinkExtor, MIME::QuotedPrint, ARIN RDAP

REPOSITORY

https://github.com/nigelhorne/Email-Abuse-Investigator

SUPPORT

This module is provided as-is without any warranty.

Please report any bugs or feature requests to bug-email-abuse-investigator at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Email-Abuse-Investigator I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

You can find documentation for this module with the perldoc command.

perldoc Email::Abuse::Investigator

You can also look for information at:

LICENCE AND COPYRIGHT

Copyright 2026 Nigel Horne.

Usage is subject to licence terms.

The licence terms of this software are as follows: