NAME

Algorithm::Classifier::IsolationForest - unsupervised anomaly detection via Isolation Forest or Extended Isolation Forest

SYNOPSIS

use Algorithm::Classifier::IsolationForest;

my @data = ([0.1, -0.2], [0.0, 0.1], [5.0, 6.0], ...);

# Classic, axis-parallel Isolation Forest
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees     => 100,
    sample_size => 256,
    seed        => 42,
);
$iforest->fit(\@data);

my $scores = $iforest->score_samples(\@data);  # arrayref, each in (0,1]
my $flags  = $iforest->predict(\@data, 0.6);    # arrayref of 0/1

# Save and reload
$iforest->save('model.json');
my $reloaded = Algorithm::Classifier::IsolationForest->load('model.json');

# Extended Isolation Forest (oblique hyperplane splits)
my $eif = Algorithm::Classifier::IsolationForest->new(
    mode => 'extended',
    seed => 42,
);
$eif->fit(\@data);

# Parallel training (fork-based, Unix-like platforms): build the
# n_trees across several worker processes.
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees      => 200,
    sample_size  => 256,
    seed         => 42,
    parallel_fit => 4,        # 4 forked workers
);
$iforest->fit(\@data);

# Pre-pack a dataset to skip the per-call input-walk cost when the
# same data gets scored many times (interactive tuning, dashboards).
my $packed = $iforest->pack_data(\@data);
my $scores = $iforest->score_samples($packed);
my $flags  = $iforest->predict($packed, 0.6);

# Get scores and labels as two flat arrayrefs in one call -- cheaper
# than score_predict_samples when you don't need the paired shape.
my ($s, $l) = $iforest->score_predict_split(\@data, 0.6);

DESCRIPTION

Isolation Forest (Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua, 2008) detects anomalies by random partitioning rather than by modelling normal points. Each tree repeatedly splits the data. Points that get isolated after only a few splits are likely anomalies. The score is the average isolation depth across many trees, normalised so values approach 1 for anomalies and stay below 0.5 for normal points.

In extended mode the module implements the Extended Isolation Forest variant. Each split is a random hyperplane instead of an axis-aligned cut, which removes the rectangular, axis-aligned bias in the score field and tends to help on elongated or multi-modal data.

psi referenced below is ψ or the pitchfork math symbol referenced in the paper, Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

... or max samples.

https://www.researchgate.net/publication/224384174_Isolation_Forest

NATIVE ACCELERATION (Inline::C and OpenMP)

Both the scoring hot path (score_samples, predict, path_lengths, score_predict_samples, and score_predict_split) and the fit() tree builder are automatically accelerated through Inline::C when it is installed and a working C compiler is reachable. If the toolchain also accepts -fopenmp and can link against libgomp, the per-point tree walk runs in parallel across all available CPU cores using OpenMP, and the extended-mode oblique dot product is vectorised via #pragma omp simd -- which on modern x86 compilers translates to an unrolled FMA / AVX gather chain that's substantially faster for high-feature-count extended models.

fit()'s tree builder (subsampling plus the recursive axis/oblique split search) runs in C the same way when use_c is on, replacing the per-node Perl arrayref copying with plain int-array partitioning -- typically an order of magnitude faster, and dramatically more so at higher feature counts where the pure-Perl per-cell loop dominates. Its random draws go through the same generator rand()/srand() use internally, in the same call order the pure-Perl builder uses, so a given seed produces bit-identical trees whether use_c is on or off -- switching backends changes only how fast the model is built, not the model itself.

By default this C builder is single-threaded per call, because Perl's RNG state isn't safe to share across OpenMP threads. Two ways to scale fit() across cores are available (see below for why they don't compose):

parallel_fit forks N worker processes, each building its share of the trees with the (still single-threaded) C builder. Fixed IPC/serialisation overhead per worker means this can cost more than it saves once a fit already completes in milliseconds; it's most useful once a single-process fit is large enough that the fork/Storable overhead is small relative to the work being split.
use_openmp_fit builds trees across OpenMP threads within a single process (one tree per thread), using a separate, thread-safe PRNG seeded per tree index instead of Perl's rand(). This means trees built with use_openmp_fit are not bit-identical to the default use_c path for the same seed -- but a fixed seed and n_trees still reproduce the same trees regardless of OMP_NUM_THREADS or how OpenMP schedules the work. It's off by default (unlike use_c/use_openmp, which only ever change speed, this changes which trees get built) and only takes effect when use_c is also on and OpenMP is linked in.

These two do NOT compose, despite both existing to parallelise fit(). A process that has run any OpenMP region -- including plain score_samples()/predict() with the default use_openmp -- and then fork()s (as parallel_fit does) hands each child a copy of libgomp's thread pool whose worker threads did not survive the fork. A child that then starts its own #pragma omp parallel region (as use_openmp_fit would) tries to reuse that now-invalid pool and hangs. This is a general limitation of combining fork() with OpenMP, not something fixable from Perl, so parallel_fit's forked workers always use the single-threaded C builder regardless of use_openmp_fit -- setting both just means parallel_fit wins and use_openmp_fit has no effect for that call.

Detection happens once when the module is loaded; the compiled artefact is cached under _Inline/ and reused on subsequent runs. Four package variables report what the build picked up:

$Algorithm::Classifier::IsolationForest::HAS_C       # 0/1
$Algorithm::Classifier::IsolationForest::HAS_OPENMP  # 0/1
$Algorithm::Classifier::IsolationForest::HAS_SIMD    # 0/1 (OpenMP 4.0+)
$Algorithm::Classifier::IsolationForest::OPT_LEVEL   # e.g. "-O3 -march=native", '' if HAS_C is 0

Neither dependency is required. Without Inline::C the module falls back to a pure-Perl implementation that produces identical results, just slower; without OpenMP the C backend runs single-threaded.

The bundled iforest accel subcommand performs a tiny fit + score and prints which backend is active (including the build flags below), which is the recommended way to verify the build picked up the optional dependencies on a given machine.

Tuning the C build

These environment variables are read once, the first time the module is loaded, so they must be set before that -- e.g. in the shell before running a script, not via %ENV inside the script itself.

IF_NO_C=1 -- skip attempting to build the C backend entirely. Equivalent to constructing every instance with use_c => 0, but without needing to touch every call site; useful for a clean pure-Perl timing baseline, or to avoid the compile attempt's overhead/noise on a host known to lack a C compiler (the attempt already fails gracefully without this, so it's a convenience, not a correctness fix).
IF_OPT=-O2 (or -O0/-O1/-Os/-Og/-Oz) -- override the default -O3, e.g. to shorten build time while iterating, or work around a miscompile on an unusual toolchain. Invalid values are ignored with a warning rather than passed through, since this string reaches a compiler command line.
IF_ARCH=<value> -- adds -march=<value> so the compiler can target specific instruction-set extensions (AVX2 gather + FMA, etc.) for the extended-mode oblique dot product and the fit-time min/max scan's #pragma omp simd loops. Accepts values like x86-64-v3, skylake, or znver3 -- whatever your compiler's -march= accepts. Also validated (a restricted character set, not passed through as-is) for the same reason as IF_OPT.
IF_NATIVE=1 -- shorthand for IF_ARCH=native; ignored if IF_ARCH is also set. Prefer a specific IF_ARCH value over this on a machine you don't control exclusively (a shared build host, a container base image): blanket -march=native pulls in whatever instruction sets the build host happens to have, including AVX-512 on some Intel CPUs -- which is known to trigger clock throttling under sustained heavy use and can make throughput worse than a conservative target like x86-64-v3 (AVX2, no AVX-512). If in doubt, benchmark both before committing to one.

Whichever of these are used, the cached artefact under _Inline/ is pinned to that build's instruction set -- delete _Inline/ (or use a separate one per host) if the directory is shared across machines with different CPUs, or a stale binary built for a narrower instruction set than the current host will simply keep being reused.

Tuning the OpenMP runtime

These are standard OpenMP environment variables libgomp already reads at run time (set before running your script, no module-specific handling needed) -- listed here because they matter most for exactly the workloads this module has: score_all_xs's per-point parallel loop and use_openmp_fit's per-tree parallel loop.

OMP_NUM_THREADS=N -- caps how many threads a parallel region uses. Useful to leave headroom for other work sharing the machine, or to pin down use_openmp_fit reproducibility checks (see its docs above: results don't depend on this, but it's a natural thing to vary when confirming that).
OMP_PROC_BIND=close / OMP_PLACES=cores -- on multi-socket or otherwise NUMA machines, pins each thread to a core near where its data already lives instead of letting the OS scheduler migrate threads across sockets mid-run. Both score_all_xs (each thread scans its own slice of the packed query buffer) and use_openmp_fit (each thread builds one tree from packed training data) benefit from this when the input is large enough to not fit comfortably in one socket's cache.

These cost nothing to try -- unlike IF_ARCH/IF_NATIVE, they're read fresh every run, not baked into a cached binary, so there's no downside to experimenting per invocation.

GENERAL METHODS

new(%args)

Inits the object.

- n_trees :: number of isolation trees in the ensemble
    default :: 100

- sample_size :: sub-sample size used to build each tree... max samples
    default :: 256

 - max_depth :: per-tree height limit... if not defined is set to ceil(log2(psi))
     default :: undef

 - seed :: optional integer to seed srand with for reproducible trees...
         see perldoc -f srand for more info. This number is processed via abs(int()).
     default :: undef

 - mode :: if it should be IF or EIF
      axis :: classic axis-parallel splits (IF)
      extended :: oblique hyperplane splits (EIF)
    default :: axis

 - extension_level :: extended mode only... how many features take partin each
         split. 0 behaves like a single-feature (axis) cut; the
         maximum (n_features - 1) uses every varying feature. undef
         => maximum. Clamped to [0, n_features - 1] at fit time.

  - contamination :: expected fraction of anomalies, in (0, 0.5]. When given,
        fit() learns a score threshold that flags this fraction of
        the training set, and predict() uses it by default. undef
        => no learned threshold (predict() falls back to 0.5).
      default :: undef

  - missing :: how fit() treats undef (missing) feature cells. Scoring always
        tolerates undef regardless of this setting; it governs fit().
          die    :: croak from fit() if the training data contains any
                    undef cell. Scoring still maps undef to 0 (the
                    long-standing behaviour), so a model fitted on clean
                    data can still score rows with missing features.
          zero   :: treat a missing cell as the value 0, at fit and score.
          impute :: replace a missing cell with the per-feature mean (or
                    median, see impute_with) learned from the present
                    values at fit time. The fill vector is stored on the
                    model and reused for scoring and persistence.
          nan    :: build feature ranges from present values only and route
                    a point missing the split feature to the right child,
                    consistently at fit and score time. Missingness is
                    preserved as signal rather than filled.
      default :: die

  - impute_with :: 'mean' or 'median'; the statistic used to compute the
        per-feature fill under missing => 'impute'. Ignored otherwise.
      default :: mean

  - parallel_fit :: positive integer N => build the trees across N forked
        worker processes during fit(). Each worker gets a derived seed
        (parent seed + worker_id * 1009) so the parallel fit is
        reproducible across runs at fixed worker count -- but the trees
        produced are NOT bit-identical to a serial fit with the same
        seed, because the RNG draws happen in a different order.
        Inference is unaffected. Falls back silently to serial on
        platforms without a real fork() (e.g. Windows without Cygwin).
      default :: undef (serial)

  - use_c :: boolean, override whether the Inline::C backend is used for
        both scoring and fit()'s tree builder.  When false the instance
        falls back to pure Perl for both even if the C backend compiled
        successfully.  When true (or unset) the C backend is used if
        available ($HAS_C).  fit() with use_c on produces bit-identical
        trees to use_c off for the same seed -- only build speed differs.
      default :: $HAS_C

  - use_openmp :: boolean, override whether OpenMP parallel scoring is
        used inside score_all_xs().  When false the C tree walk runs
        single-threaded even if OpenMP was linked in.  Ignored when
        use_c is false (pure Perl has no OpenMP path).
      default :: $HAS_OPENMP

  - use_openmp_fit :: boolean, build fit()'s trees across OpenMP threads
        (one tree per thread) instead of the single-threaded C builder.
        Opt-in and off by default: unlike use_c/use_openmp, this changes
        which trees get built. Perl's RNG isn't safe to call from
        multiple OS threads sharing one interpreter, so this path seeds
        an independent PRNG per tree from the tree index rather than
        Drand01() -- trees differ from the use_c (single-threaded)
        and pure-Perl paths even with the same seed, though a fixed
        seed and n_trees still reproduce the same trees regardless of
        OMP_NUM_THREADS or scheduling. Does NOT compose with
        parallel_fit: a forked child starting its own OpenMP region
        after the parent process has used OpenMP for anything can
        hang (a general fork()+libgomp limitation), so parallel_fit's
        workers always use the single-threaded C builder regardless
        of this setting -- setting both just means parallel_fit wins.
        Ignored (clamped to 0) when use_c is false or OpenMP isn't
        linked in.
      default :: 0

Note: log2 under Perl is as below...

log($psi) / log(2)

decision_threshold

The score cutoff predict uses by default; undef unless contamination was set.

fit

Trains the model on the specified data.

The data taken is an array of arrays. Each sub-array is one sample and must contain one or more numeric features. All samples must have the same number of features. There is no upper limit on dimensionality.

@training_data = (
    [ 3, 5 ],
    [ 2.3, 1 ],
    [ 5, 9 ],
    ...
);

# Three-feature example
@training_data = (
    [ 1.0, 2.0, 3.0 ],
    [ 1.1, 1.9, 3.1 ],
    ...
);

Below shows an example of building a gaussian cluster and using that for training.

# so it is reproducible
srand(7);

# build a gaussian cluster and add a handful of outliers...

use constant PI => 3.14159265358979;
sub gaussian {
    my ($mu, $sigma) = @_;
    my $u1 = rand() || 1e-12;
    my $u2 = rand();
    my $z  = sqrt(-2 * log($u1)) * cos(2 * PI * $u2);
    return $mu + $sigma * $z;
}

# add some normal items
for (1 .. 500) {
    push @data,  [ gaussian(0, 1), gaussian(0, 1) ];
    push @truth, 0;
}
# add some outliers
for (1 .. 20) {
    my $angle  = rand() * 2 * PI;
    my $radius = 5 + rand() * 3;             # distance 5..8 from the origin
    push @data,  [ $radius * cos($angle), $radius * sin($angle) ];
    push @truth, 1;
}

$iforest->fit(\@data);

pack_data(\@data)

Returns an opaque, blessed wrapper around the input dataset that the scoring methods can use directly, skipping the per-call work of walking the arrayref-of-arrayrefs and converting each cell into a double. At high feature counts this is a meaningful win when the same dataset is scored repeatedly (e.g. interactive threshold tuning, dashboards, plotting that updates as parameters change).

Requires the Inline::C backend; croaks if use_c is false.

my $packed = $forest->pack_data(\@data);

# Now any of these accept either an arrayref or the packed wrapper:
my $scores = $forest->score_samples($packed);
my $flags  = $forest->predict($packed, 0.6);
my ($s, $l) = $forest->score_predict_split($packed);

The wrapper has n_pts and n_feats accessors for introspection. The feature count is matched against the model on every call; passing a packed dataset built for a different feature count is a fatal error.

path_lengths(\@data)

Returns an arrayref of the mean isolation depth per sample, for inspection.

my $lengths = $forest->path_lengths(\@data);

print "x, y, length\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$lengths->[$int]."\n";

    $int++;
}

predict(\@data, $threshold)

Returns an arrayref of 0/1 labels for the specified data.

If threshold is not specified it uses the contamination-learned cutoff (if fit was called with contamination), otherwise 0.5.

my $results = $forest->predict(\@data, $threshold);

print "x, y, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int]."\n";

    $int++;
}

score_samples(\@data)

Returns an arrayref of anomaly scores, between 0 and 1.

Scores near 1 are strong anomalies (isolated quickly).

Scores well below 0.5 are normal.

Scores ~0.5 means the points are hard to tell apart.

my $scores = $forest->score_samples(\@data);

print "x, y, score\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$scores->[$int]."\n";

    $int++;
}

score_predict_samples

Returns an array ref of arrays. First value of each sub array is the score with the second being 0/1 for if it is a anomaly or not.

$threshold defaults the same way as in predict.

my $results = $forest->score_predict_samples(\@data, $threshold);

print "x, y, score, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int][0].', '.$results->[$int][1]."\n";

    $int++;
}

score_predict_split(\@data, $threshold)

Same data as "score_predict_samples" but returned as two flat arrayrefs instead of an arrayref-of-pairs. Allocates roughly half as many Perl SVs per point (no inner AV, no RV per row), so it is meaningfully faster when both scores and labels are wanted but the paired shape is not.

In list context returns ($scores_aref, $labels_aref).

my ($scores, $labels) = $forest->score_predict_split(\@data);

for my $i (0 .. $#$scores) {
    printf "%s -> score %.4f, label %d\n",
        join(',', @{ $data[$i] }), $scores->[$i], $labels->[$i];
}

$threshold defaults to the contamination-learned cutoff (if fit was called with contamination) or 0.5.

MODEL SAVE/LOAD METHODS

to_json

Returns a JSON representation of the model.

Requires fit to have been called.

my $json = $iforest->to_json;

from_json($json)

Init the object from the model in the specified JSON string.

my $iforest = Algorithm::Classifier::IsolationForest->from_json($json);

save($path)

Saves the model to the specified path.

$iforest->save($path);

load($path)

Init the object from the model in the specified file.

my $iforest = Algorithm::Classifier::IsolationForest->load($path);

REFERENCES

Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.