NAME

Algorithm::Classifier::IsolationForest - unsupervised anomaly detection via Isolation Forest or Extended Isolation Forest

SYNOPSIS

use Algorithm::Classifier::IsolationForest;

my @data = ([0.1, -0.2], [0.0, 0.1], [5.0, 6.0], ...);

# Classic, axis-parallel Isolation Forest
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees     => 100,
    sample_size => 256,
    seed        => 42,
);
$iforest->fit(\@data);

my $scores = $iforest->score_samples(\@data);  # arrayref, each in (0,1]
my $flags  = $iforest->predict(\@data, 0.6);    # arrayref of 0/1

# Save and reload
$iforest->save('model.json');
my $reloaded = Algorithm::Classifier::IsolationForest->load('model.json');

# Extended Isolation Forest (oblique hyperplane splits)
my $eif = Algorithm::Classifier::IsolationForest->new(
    mode => 'extended',
    seed => 42,
);
$eif->fit(\@data);

# Parallel training (fork-based, Unix-like platforms): build the
# n_trees across several worker processes.
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees      => 200,
    sample_size  => 256,
    seed         => 42,
    parallel_fit => 4,        # 4 forked workers
);
$iforest->fit(\@data);

# Pre-pack a dataset to skip the per-call input-walk cost when the
# same data gets scored many times (interactive tuning, dashboards).
my $packed = $iforest->pack_data(\@data);
my $scores = $iforest->score_samples($packed);
my $flags  = $iforest->predict($packed, 0.6);

# Get scores and labels as two flat arrayrefs in one call -- cheaper
# than score_predict_samples when you don't need the paired shape.
my ($s, $l) = $iforest->score_predict_split(\@data, 0.6);

DESCRIPTION

Isolation Forest (Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua, 2008) detects anomalies by random partitioning rather than by modelling normal points. Each tree repeatedly splits the data. Points that get isolated after only a few splits are likely anomalies. The score is the average isolation depth across many trees, normalised so values approach 1 for anomalies and stay below 0.5 for normal points.

In extended mode the module implements the Extended Isolation Forest variant. Each split is a random hyperplane instead of an axis-aligned cut, which removes the rectangular, axis-aligned bias in the score field and tends to help on elongated or multi-modal data.

With voting => 'majority' the module implements the Majority Voting Isolation Forest (MVIForest) aggregation: each tree votes a sample anomalous or normal against the decision threshold and the label is the majority of the votes, with prediction stopping early once the majority is reached. Trees are built identically either way, so this composes with both axis and extended mode, and an existing model can be flipped between the two modes with "set_voting" without refitting; see voting under "new(%args)".

For data that arrives as a stream and may drift over time, the companion class Algorithm::Classifier::IsolationForest::Online implements Online Isolation Forest (Leveni et al. 2024): no fit(), instead points are learned as they arrive and forgotten once they age out of a sliding window. Models saved by either class can be loaded through "load", which dispatches on the stored format tag.

psi referenced below is ψ or the pitchfork math symbol referenced in the paper, Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

... or max samples.

https://www.researchgate.net/publication/224384174_Isolation_Forest

NATIVE ACCELERATION (Inline::C and OpenMP)

Both the scoring hot path (score_samples, predict, path_lengths, score_predict_samples, and score_predict_split) and the fit() tree builder are automatically accelerated through Inline::C when it is installed and a working C compiler is reachable. If the toolchain also accepts -fopenmp and can link against libgomp, the per-point tree walk runs in parallel across all available CPU cores using OpenMP, and the extended-mode oblique dot product is vectorised via #pragma omp simd -- which on modern x86 compilers translates to an unrolled FMA / AVX gather chain that's substantially faster for high-feature-count extended models.

fit()'s tree builder (subsampling plus the recursive axis/oblique split search) runs in C the same way when use_c is on, replacing the per-node Perl arrayref copying with plain int-array partitioning -- typically an order of magnitude faster, and dramatically more so at higher feature counts where the pure-Perl per-cell loop dominates. Its random draws go through the same generator rand()/srand() use internally, in the same call order the pure-Perl builder uses, so a given seed produces bit-identical trees whether use_c is on or off -- switching backends changes only how fast the model is built, not the model itself. On perls whose NV is wider than a C double (-Duselongdouble / -Dusequadmath) the pure-Perl builder rounds each stored value to double precision to preserve this parity; axis mode matches exactly, while extended mode can still differ on rare libm rounding ties (double vs long-double transcendentals).

By default this C builder is single-threaded per call, because Perl's RNG state isn't safe to share across OpenMP threads. Two ways to scale fit() across cores are available (see below for why they don't compose):

parallel_fit forks N worker processes, each building its share of the trees with the (still single-threaded) C builder. Fixed IPC/serialisation overhead per worker means this can cost more than it saves once a fit already completes in milliseconds; it's most useful once a single-process fit is large enough that the fork/Storable overhead is small relative to the work being split.
use_openmp_fit builds trees across OpenMP threads within a single process (one tree per thread), using a separate, thread-safe PRNG seeded per tree index instead of Perl's rand(). This means trees built with use_openmp_fit are not bit-identical to the default use_c path for the same seed -- but a fixed seed and n_trees still reproduce the same trees regardless of OMP_NUM_THREADS or how OpenMP schedules the work. It's off by default (unlike use_c/use_openmp, which only ever change speed, this changes which trees get built) and only takes effect when use_c is also on and OpenMP is linked in.

These two do NOT compose, despite both existing to parallelise fit(). A process that has run any OpenMP region -- including plain score_samples()/predict() with the default use_openmp -- and then fork()s (as parallel_fit does) hands each child a copy of libgomp's thread pool whose worker threads did not survive the fork. A child that then starts its own #pragma omp parallel region (as use_openmp_fit would) tries to reuse that now-invalid pool and hangs. This is a general limitation of combining fork() with OpenMP, not something fixable from Perl, so parallel_fit's forked workers always use the single-threaded C builder regardless of use_openmp_fit -- setting both just means parallel_fit wins and use_openmp_fit has no effect for that call.

Detection happens once when the module is loaded. When the distribution was installed with Inline::C available, the C backend was already compiled during make and the installed object is loaded directly (see "Compile at install time (the prebuilt object)" below); otherwise the backend is compiled on first load and the artefact is cached under _Inline/ and reused on subsequent runs. Five package variables report what the load picked up:

$Algorithm::Classifier::IsolationForest::HAS_C       # 0/1
$Algorithm::Classifier::IsolationForest::HAS_OPENMP  # 0/1
$Algorithm::Classifier::IsolationForest::HAS_SIMD    # 0/1 (OpenMP 4.0+)
$Algorithm::Classifier::IsolationForest::OPT_LEVEL   # e.g. "-O3 -march=native", '' if HAS_C is 0
$Algorithm::Classifier::IsolationForest::C_SOURCE    # 'prebuilt' / 'runtime', '' if HAS_C is 0

Neither dependency is required. Without Inline::C the module falls back to a pure-Perl implementation that produces identical results, just slower; without OpenMP the C backend runs single-threaded.

The bundled iforest accel subcommand performs a tiny fit + score and prints which backend is active (including the build flags below), which is the recommended way to verify the build picked up the optional dependencies on a given machine.

Compile at install time (the prebuilt object)

When Inline::C is usable while the distribution itself is being built, perl Makefile.PL arranges for the C backend to be compiled once during make and installed alongside the module like any XS object. At run time that object is loaded directly through XSLoader: no C compiler, no Inline modules, and no _Inline/ cache directory are needed on the machine the module ends up running on, and the first-load compile pause disappears entirely.

On x86-64 hardware from roughly the last decade, IF_ARCH=x86-64-v3 perl Makefile.PL is a reasonable configure line: it bakes AVX2 + FMA (without AVX-512) into the prebuilt object, which can speed up extended-mode scoring (how much is hardware-dependent -- benchmark with iforest bench before assuming) while avoiding the -march=native caveats described under "Tuning the C build". Bit-for-bit result parity with the pure-Perl backend is preserved either way (see IF_ARCH below).

The IF_* build flags described below are captured when perl Makefile.PL runs -- set them in the environment of that command, not of make -- and recorded in the generated Algorithm::Classifier::IsolationForest::BuildFlags module, which thereby also fixes what the prebuilt object was compiled with. At run time the recorded values serve as the defaults, so a process started with no IF_* variables set uses the prebuilt object as-is.

Setting IF_* variables at run time keeps working exactly as in releases without prebuilt support: if the requested flags differ from the recorded ones, the prebuilt object (compiled with the wrong flags for the request) is skipped and the module compiles at first load into _Inline/ -- which does need Inline::C and a compiler on that machine. Two related knobs exist:

IF_RUNTIME_BUILD=1 -- ignore the prebuilt object unconditionally and compile at first load even though the requested flags match the recorded ones. Useful when the installed object is suspect (built on a different CPU than it now runs on, linked against a libgomp that has since changed) or to A/B a fresh local build against the shipped one.
IF_INSTALL_BUILD=1 -- internal; set by the generated Makefile rule that performs the install-time compile. Not meant for manual use.

If the prebuilt object cannot be loaded for any reason (deleted, built against a different perl, version mismatch after an upgrade), the module quietly falls through the same chain as always: runtime Inline::C build first, pure Perl last.

Tuning the C build

These environment variables are read once, the first time the module is loaded, so they must be set before that -- e.g. in the shell before running a script, not via %ENV inside the script itself. They are also read by perl Makefile.PL to pick the flags baked into the prebuilt object (see above); at run time they override the recorded configure-time values, at the price of a runtime compile.

IF_NO_C=1 -- skip attempting to build the C backend entirely. Equivalent to constructing every instance with use_c => 0, but without needing to touch every call site; useful for a clean pure-Perl timing baseline, or to avoid the compile attempt's overhead/noise on a host known to lack a C compiler (the attempt already fails gracefully without this, so it's a convenience, not a correctness fix).
IF_OPT=-O2 (or -O0/-O1/-Os/-Og/-Oz) -- override the default -O3, e.g. to shorten build time while iterating, or work around a miscompile on an unusual toolchain. Invalid values are ignored with a warning rather than passed through, since this string reaches a compiler command line.
IF_ARCH=<value> -- adds -march=<value> so the compiler can target specific instruction-set extensions (AVX2 gather + FMA, etc.) for the extended-mode oblique dot product and the fit-time min/max scan's #pragma omp simd loops. Accepts values like x86-64-v3, skylake, or znver3 -- whatever your compiler's -march= accepts. Also validated (a restricted character set, not passed through as-is) for the same reason as IF_OPT. The special value none (or an empty string) opts out of any arch recorded at configure time, yielding a plain build. Whenever a -march is in effect the build also adds -ffp-contract=off: with FMA available the compiler would otherwise contract a*b+c into fused multiply-adds whose different rounding breaks the guarantee that use_c => 1 and use_c => 0 build bit-identical trees (the -march speedup comes from vectorization, not contraction, so this costs essentially nothing).
IF_NATIVE=1 -- shorthand for IF_ARCH=native; ignored if IF_ARCH is also set. Prefer a specific IF_ARCH value over this on a machine you don't control exclusively (a shared build host, a container base image): blanket -march=native pulls in whatever instruction sets the build host happens to have, including AVX-512 on some Intel CPUs -- which is known to trigger clock throttling under sustained heavy use and can make throughput worse than a conservative target like x86-64-v3 (AVX2, no AVX-512). If in doubt, benchmark both before committing to one.
IF_NO_OPENMP=1 -- build (or select) the serial C backend: the OpenMP compile attempt is skipped entirely, so the resulting object has no libgomp linkage and never starts an OpenMP runtime inside the process. This differs from OMP_NUM_THREADS=1, which merely runs the parallel code on one thread but still loads libgomp. Set at perl Makefile.PL time it yields a serial prebuilt object; set at run time against an OpenMP prebuilt install it triggers a runtime serial build (needing a compiler). An explicit IF_NO_OPENMP=0 re-enables OpenMP over a serial configure-time default.

Whichever of these are used, the cached artefact under _Inline/ is pinned to that build's instruction set -- delete _Inline/ (or use a separate one per host) if the directory is shared across machines with different CPUs, or a stale binary built for a narrower instruction set than the current host will simply keep being reused.

Tuning the OpenMP runtime

These are standard OpenMP environment variables libgomp already reads at run time (set before running your script, no module-specific handling needed) -- listed here because they matter most for exactly the workloads this module has: score_all_xs's per-point parallel loop and use_openmp_fit's per-tree parallel loop.

OMP_NUM_THREADS=N -- caps how many threads a parallel region uses. Useful to leave headroom for other work sharing the machine, or to pin down use_openmp_fit reproducibility checks (see its docs above: results don't depend on this, but it's a natural thing to vary when confirming that).
OMP_PROC_BIND=close / OMP_PLACES=cores -- on multi-socket or otherwise NUMA machines, pins each thread to a core near where its data already lives instead of letting the OS scheduler migrate threads across sockets mid-run. Both score_all_xs (each thread scans its own slice of the packed query buffer) and use_openmp_fit (each thread builds one tree from packed training data) benefit from this when the input is large enough to not fit comfortably in one socket's cache.

These cost nothing to try -- unlike IF_ARCH/IF_NATIVE, they're read fresh every run, not baked into a cached binary, so there's no downside to experimenting per invocation.

GENERAL METHODS

new(%args)

Inits the object.

- n_trees :: number of isolation trees in the ensemble
    default :: 100

- sample_size :: sub-sample size used to build each tree... max samples
    default :: 256

 - max_depth :: per-tree height limit... if not defined is set to ceil(log2(psi))
     default :: undef

 - seed :: optional integer to seed srand with for reproducible trees...
         see perldoc -f srand for more info. This number is processed via abs(int()).
     default :: undef

 - mode :: if it should be IF or EIF
      axis :: classic axis-parallel splits (IF)
      extended :: oblique hyperplane splits (EIF)
    default :: axis

 - extension_level :: extended mode only... how many features take partin each
         split. 0 behaves like a single-feature (axis) cut; the
         maximum (n_features - 1) uses every varying feature. undef
         => maximum. Clamped to [0, n_features - 1] at fit time.

  - contamination :: expected fraction of anomalies, in (0, 0.5]. When given,
        fit() learns a score threshold that flags this fraction of
        the training set, and predict() uses it by default. undef
        => no learned threshold (predict() falls back to 0.5).
      default :: undef

  - missing :: how fit() treats undef (missing) feature cells. Scoring always
        tolerates undef regardless of this setting; it governs fit().
          die    :: croak from fit() if the training data contains any
                    undef cell. Scoring still maps undef to 0 (the
                    long-standing behaviour), so a model fitted on clean
                    data can still score rows with missing features.
          zero   :: treat a missing cell as the value 0, at fit and score.
          impute :: replace a missing cell with the per-feature mean (or
                    median, see impute_with) learned from the present
                    values at fit time. The fill vector is stored on the
                    model and reused for scoring and persistence.
          nan    :: build feature ranges from present values only and route
                    a point missing the split feature to the right child,
                    consistently at fit and score time. Missingness is
                    preserved as signal rather than filled.
      default :: die

  - impute_with :: 'mean' or 'median'; the statistic used to compute the
        per-feature fill under missing => 'impute'. Ignored otherwise.
      default :: mean

  - voting :: how the per-tree results are aggregated at scoring time.
        Trees are built identically in both settings -- only aggregation
        changes -- so the knob composes with either mode (axis or
        extended) and an existing model may switch it after the fact with
        set_voting() (which relearns a contamination threshold for the
        new mode).
          mean     :: classic Isolation Forest: a sample's path lengths
                      across all trees are averaged and normalised into
                      one anomaly score; predict() thresholds that score.
          majority :: Majority Voting Isolation Forest (MVIForest;
                      Chabchoub, Togbe, Boly & Chiky 2022 -- see
                      REFERENCES). Each tree scores the sample on its own
                      (s_i = 2**(-h_i / c(psi))) and votes it anomalous
                      when s_i >= the decision threshold; predict() flags
                      the sample when more than half of the trees
                      (int(n_trees/2) + 1) vote anomalous, and stops
                      walking trees per sample as soon as the outcome is
                      decided. The threshold argument/default of the
                      predict methods is therefore the PER-TREE cutoff
                      here, not a forest-level score cutoff.
                      score_samples() returns the fraction of trees
                      voting anomalous -- still in [0, 1], but discrete
                      in steps of 1/n_trees. contamination composes: fit()
                      learns the per-tree cutoff that flags the requested
                      fraction of the training set.
      default :: mean

  - parallel_fit :: positive integer N => build the trees across N forked
        worker processes during fit(). Each worker gets a derived seed
        (parent seed + worker_id * 1009) so the parallel fit is
        reproducible across runs at fixed worker count -- but the trees
        produced are NOT bit-identical to a serial fit with the same
        seed, because the RNG draws happen in a different order.
        Inference is unaffected. Falls back silently to serial on
        platforms without a real fork() (e.g. Windows without Cygwin).
      default :: undef (serial)

  - use_c :: boolean, override whether the Inline::C backend is used for
        both scoring and fit()'s tree builder.  When false the instance
        falls back to pure Perl for both even if the C backend compiled
        successfully.  When true (or unset) the C backend is used if
        available ($HAS_C).  fit() with use_c on produces bit-identical
        trees to use_c off for the same seed -- only build speed differs.
      default :: $HAS_C

  - use_openmp :: boolean, override whether OpenMP parallel scoring is
        used inside score_all_xs().  When false the C tree walk runs
        single-threaded even if OpenMP was linked in.  Ignored when
        use_c is false (pure Perl has no OpenMP path).
      default :: $HAS_OPENMP

  - use_openmp_fit :: boolean, build fit()'s trees across OpenMP threads
        (one tree per thread) instead of the single-threaded C builder.
        Opt-in and off by default: unlike use_c/use_openmp, this changes
        which trees get built. Perl's RNG isn't safe to call from
        multiple OS threads sharing one interpreter, so this path seeds
        an independent PRNG per tree from the tree index rather than
        Drand01() -- trees differ from the use_c (single-threaded)
        and pure-Perl paths even with the same seed, though a fixed
        seed and n_trees still reproduce the same trees regardless of
        OMP_NUM_THREADS or scheduling. Does NOT compose with
        parallel_fit: a forked child starting its own OpenMP region
        after the parent process has used OpenMP for anything can
        hang (a general fork()+libgomp limitation), so parallel_fit's
        workers always use the single-threaded C builder regardless
        of this setting -- setting both just means parallel_fit wins.
        Ignored (clamped to 0) when use_c is false or OpenMP isn't
        linked in.
      default :: 0

  - feature_names :: optional arrayref of per-feature labels enabling the
        *_tagged methods (and required by mungers below).
      default :: undef

  - mungers :: optional hashref of declarative L<Algorithm::ToNumberMunger>
        specs, keyed as that module's compile() expects (scalar mungers by
        their output tag, expanding mungers by any label with an 'into'
        list, combining mungers by their output tag with a 'from' list).
        When set, every tagged row -- the *_tagged methods, fit_tagged,
        and tagged_row_to_array -- is munged from raw values (strings,
        timestamps, status codes, ...) into numbers through the compiled
        plan, and munge_rows() applies the scalar mungers to positional
        rows.  Requires feature_names; the plan compiles against them, so
        any spec error croaks here in new().  Algorithm::ToNumberMunger is
        an optional dependency, required only when a spec is given (or a
        loaded model carrying one is used with tagged data).  The spec is
        saved with the model, so a loaded model munges scoring input
        exactly as it did training input.  See L</MUNGERS> for details
        and caveats.
      default :: undef

  - schema_version :: optional opaque string identifying the revision of
        the variable schema this model was built against.  Never parsed
        or compared numerically; saved with the model and shown by
        `iforest info`.  Usually set from a prototype (see
        L</PROTOTYPES>) rather than passed directly.
      default :: undef

  - schema_description :: optional opaque free-text description of what
        the variable schema is.  Same handling as schema_version.
      default :: undef

  - feature_descriptions :: optional hashref of 'feature name => free
        text' describing individual features.  Requires feature_names;
        every key must name an entry there (a description for a feature
        that does not exist croaks -- it is either a typo or a stale
        leftover from a schema change).  Partial coverage is fine.
        Saved with the model and shown beside each tag by
        `iforest info`.
      default :: undef

Note: log2 under Perl is as below...

log($psi) / log(2)

decision_threshold

The score cutoff predict uses by default; undef unless contamination was set.

set_voting

Switches the scoring-time aggregation between 'mean' and 'majority' on an existing model and returns $self (so it chains). The forest itself is identical in both modes -- only the way per-tree results are combined changes -- so this never rebuilds a single tree.

$iforest->set_voting('majority');
$iforest->set_voting('mean', \@training_data);

The one thing that does not carry over is a contamination-learned "decision_threshold". That cutoff is a quantile of whichever per-point quantity the mode thresholds against -- the averaged anomaly score under 'mean', the per-tree majority pivot under 'majority' -- and those live in different spaces, so a threshold learned in one mode flags the wrong fraction in the other. When the model was fitted with contamination, set_voting therefore relearns the threshold for the target mode, which requires the original training data to be passed as the second argument (the model does not retain it). Switching a model that had no contamination needs no data: predict falls back to 0.5, which is meaningful in both modes.

Passing the current mode is a no-op (returns immediately, no data needed). Calling this before "fit" just records the mode for the eventual fit.

feature_names

Returns the arrayref of feature name strings stored with the model, or undef if none were provided at fit time.

my $names = $iforest->feature_names;

schema_version

Returns the user-owned schema version string stored with the model (usually via a prototype -- see "PROTOTYPES"), or undef if none was recorded.

my $sv = $iforest->schema_version;

schema_description

Returns the free-text description of the variable schema stored with the model, or undef if none was recorded.

feature_descriptions

Returns the hashref of per-feature description strings stored with the model, or undef if none were recorded. Keys are feature names; coverage may be partial.

fit

Trains the model on the specified data.

The data taken is an array of arrays. Each sub-array is one sample and must contain one or more numeric features. All samples must have the same number of features. There is no upper limit on dimensionality.

@training_data = (
    [ 3, 5 ],
    [ 2.3, 1 ],
    [ 5, 9 ],
    ...
);

# Three-feature example
@training_data = (
    [ 1.0, 2.0, 3.0 ],
    [ 1.1, 1.9, 3.1 ],
    ...
);

Below shows an example of building a gaussian cluster and using that for training.

# so it is reproducible
srand(7);

# build a gaussian cluster and add a handful of outliers...

use constant PI => 3.14159265358979;
sub gaussian {
    my ($mu, $sigma) = @_;
    my $u1 = rand() || 1e-12;
    my $u2 = rand();
    my $z  = sqrt(-2 * log($u1)) * cos(2 * PI * $u2);
    return $mu + $sigma * $z;
}

# add some normal items
for (1 .. 500) {
    push @data,  [ gaussian(0, 1), gaussian(0, 1) ];
    push @truth, 0;
}
# add some outliers
for (1 .. 20) {
    my $angle  = rand() * 2 * PI;
    my $radius = 5 + rand() * 3;             # distance 5..8 from the origin
    push @data,  [ $radius * cos($angle), $radius * sin($angle) ];
    push @truth, 1;
}

$iforest->fit(\@data);

fit_tagged(\@rows)

Trains the model on an arrayref of hashrefs of named feature values -- the tagged counterpart of "fit". Each row goes through "tagged_row_to_array" (and therefore through the munger plan when mungers is configured, which is the point: training data and scoring data are munged by the identical plan), then the positional rows are handed to fit.

$iforest->fit_tagged([
    { cpu => 0.9, mem => 0.4, disk => 0.1 },
    { cpu => 0.2, mem => 0.3, disk => 0.2 },
    ...
]);

Requires stored feature_names. Croaks under the same conditions as "tagged_row_to_array", naming the offending row by index.

pack_data(\@data)

Returns an opaque, blessed wrapper around the input dataset that the scoring methods can use directly, skipping the per-call work of walking the arrayref-of-arrayrefs and converting each cell into a double. At high feature counts this is a meaningful win when the same dataset is scored repeatedly (e.g. interactive threshold tuning, dashboards, plotting that updates as parameters change).

Requires the Inline::C backend; croaks if use_c is false.

my $packed = $forest->pack_data(\@data);

# Now any of these accept either an arrayref or the packed wrapper:
my $scores = $forest->score_samples($packed);
my $flags  = $forest->predict($packed, 0.6);
my ($s, $l) = $forest->score_predict_split($packed);

The wrapper has n_pts and n_feats accessors for introspection. The feature count is matched against the model on every call; passing a packed dataset built for a different feature count is a fatal error.

path_lengths(\@data)

Returns an arrayref of the mean isolation depth per sample, for inspection.

my $lengths = $forest->path_lengths(\@data);

print "x, y, length\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$lengths->[$int]."\n";

    $int++;
}

predict(\@data, $threshold)

Returns an arrayref of 0/1 labels for the specified data.

If threshold is not specified it uses the contamination-learned cutoff (if fit was called with contamination), otherwise 0.5.

Under voting => 'majority' the threshold is the per-tree score cutoff each tree votes against, and a sample is labelled 1 when more than half of the trees (int(n_trees/2) + 1) vote it anomalous. Tree walking stops per sample as soon as the outcome is decided, so this is typically cheaper than scoring.

my $results = $forest->predict(\@data, $threshold);

print "x, y, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int]."\n";

    $int++;
}

predict_tagged(\%row, $threshold)

Predicts whether a single sample is an anomaly using a hashref of named feature values. The model must have been fitted (or loaded from a model that was fitted) with feature names stored via feature_names.

$threshold defaults the same way as in predict.

Returns a scalar 1 (anomaly) or 0 (normal).

my $label = $forest->predict_tagged(
    { cpu => 0.9, mem => 0.4, disk => 0.1 },
);

Croaks if the model has no stored feature names, if the hashref contains a key that is not a known feature name, or if a feature name is absent from the hashref.

tagged_row_to_array(\%row, $caller)

Validates a hashref of named feature values against the model's stored feature_names and returns a positional arrayref ready to pass to any of the scoring or prediction methods.

$caller is a string used in error messages to identify which method triggered the validation (pass the calling method's name).

my $vec = $forest->tagged_row_to_array(\%row, 'my_method');
# returns e.g. [0.9, 0.4, 0.1] ordered by feature_names

Croaks if:

$row is not a hashref
the model has no stored feature_names
the hashref contains a key that is not a known feature name
a feature name is absent from the hashref

munge_rows(\@rows)

Applies the model's scalar mungers to positional rows (arrayrefs in feature_names order), returning a new arrayref of munged rows. A model without mungers returns the input unchanged, so callers such as the CLI can pass every dataset through unconditionally.

Croaks if the munger set contains expanding or combining mungers -- their inputs are named source fields that positional rows cannot express; use the tagged methods (or "fit_tagged") for those.

my $numeric = $iforest->munge_rows(\@raw_rows);

score_samples(\@data)

Returns an arrayref of anomaly scores, between 0 and 1.

Scores near 1 are strong anomalies (isolated quickly).

Scores well below 0.5 are normal.

Scores ~0.5 means the points are hard to tell apart.

Under voting => 'majority' the returned value is instead the fraction of trees voting the sample anomalous at the model's decision threshold (the contamination-learned cutoff if present, otherwise 0.5) -- still in [0, 1], but discrete in steps of 1/n_trees, with a majority label corresponding to a fraction strictly above 0.5.

my $scores = $forest->score_samples(\@data);

print "x, y, score\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$scores->[$int]."\n";

    $int++;
}

score_sample_tagged(\%row)

Scores a single sample supplied as a hashref of named feature values. The model must have stored feature names (set via feature_names in new() or the -t CLI flag at fit time).

Returns a scalar anomaly score in (0, 1].

my $score = $forest->score_sample_tagged({ cpu => 0.9, mem => 0.4 });

Croaks if the model has no stored feature names, if the hashref contains a key that is not a known feature name, or if a feature name is absent from the hashref.

score_predict_samples

Returns an array ref of arrays. First value of each sub array is the score with the second being 0/1 for if it is a anomaly or not.

$threshold defaults the same way as in predict.

Under voting => 'majority' the score is the anomaly vote fraction at $threshold (used as the per-tree cutoff) and the label is the majority vote, matching score_samples/predict semantics in that mode.

my $results = $forest->score_predict_samples(\@data, $threshold);

print "x, y, score, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int][0].', '.$results->[$int][1]."\n";

    $int++;
}

score_predict_sample_tagged(\%row, $threshold)

Scores and classifies a single sample supplied as a hashref of named feature values. The model must have stored feature names.

$threshold defaults the same way as in predict.

Returns a two-element arrayref [$score, $label], matching the per-row shape that score_predict_samples returns for each row.

my $pair = $forest->score_predict_sample_tagged({ cpu => 0.9, mem => 0.4 });
printf "score %.4f  anomaly %d\n", $pair->[0], $pair->[1];

Croaks if the model has no stored feature names, if the hashref contains a key that is not a known feature name, or if a feature name is absent from the hashref.

score_predict_split(\@data, $threshold)

Same data as "score_predict_samples" but returned as two flat arrayrefs instead of an arrayref-of-pairs. Allocates roughly half as many Perl SVs per point (no inner AV, no RV per row), so it is meaningfully faster when both scores and labels are wanted but the paired shape is not.

In list context returns ($scores_aref, $labels_aref).

my ($scores, $labels) = $forest->score_predict_split(\@data);

for my $i (0 .. $#$scores) {
    printf "%s -> score %.4f, label %d\n",
        join(',', @{ $data[$i] }), $scores->[$i], $labels->[$i];
}

$threshold defaults to the contamination-learned cutoff (if fit was called with contamination) or 0.5.

MUNGERS

With the optional Algorithm::ToNumberMunger module, a model can carry a declarative munger spec (see mungers under "new(%args)") that turns raw tagged values -- strings, timestamps, status codes, IPs -- into the numbers the forest needs, so callers hand the model the data they actually have:

my $forest = Algorithm::Classifier::IsolationForest->new(
    feature_names => [ 'method', 'bytes_log', 'host_entropy' ],
    mungers       => {
        method       => { munger => 'http_method_enum', default => -1 },
        bytes_log    => { munger => 'log', offset => 1, from => 'bytes' },
        host_entropy => { munger => 'entropy', from => 'host' },
    },
);
$forest->fit_tagged(\@raw_rows);
my $score = $forest->score_sample_tagged(
    { method => 'POST', bytes => 51234, host => 'kq3xv9z2.example' } );

The spec is pure data and is saved with the model, so a loaded model munges scoring input exactly as it did training input -- the consistency that makes munging part of the model rather than an upstream preprocessing step. Points worth knowing:

Only tagged input is munged. Positional rows passed to fit or the scoring methods are taken as already numeric; "munge_rows" applies the scalar mungers to positional rows for callers (like the CLI) that want the same transformation there. Packed datasets ("pack_data(\@data)") are never munged.
Under a munger plan, tagged-row validation is the plan's: a missing input field croaks (including munger from sources, which need not be tags), while unknown extra keys are ignored rather than rejected.
Loading a model that carries mungers does not require Algorithm::ToNumberMunger -- inspection and positional scoring work without it; the first tagged call croaks with an install hint. A munger name unknown to an older installed Algorithm::ToNumberMunger croaks naming it; the model records munger_module_version (the version that authored the spec) to make that diagnosable.
Munging happens before the missing strategy: for munged columns the strategy sees the munger's output, and most mungers define their own undef handling (length counts undef as 0, enum takes a default, ...). Raw columns behave exactly as without mungers.
Caveats inherited from the munger set: the eps munger talks to an external service, so a saved model using it needs that service reachable wherever the model runs; frozen_freq_map/ngram count tables are part of the spec and therefore of the model file.

The munger spec composes with everything else -- modes, voting, contamination, the C backend (munging is input-side; accelerated paths are unchanged) -- and works identically on Algorithm::Classifier::IsolationForest::Online.

MODEL SAVE/LOAD METHODS

to_json

Returns a JSON representation of the model.

Requires fit to have been called.

my $json = $iforest->to_json;

from_json($json)

Init the object from the model in the specified JSON string.

my $iforest = Algorithm::Classifier::IsolationForest->from_json($json);

save($path)

Saves the model to the specified path.

$iforest->save($path);

load($path)

Init the object from the model in the specified file.

my $iforest = Algorithm::Classifier::IsolationForest->load($path);

PROTOTYPES

A prototype is a small JSON document that describes what a model should be before any data exists: the variable schema (feature names in column order, plus their munger specs, per-feature descriptions, and missing policy), a user-owned schema_version string, a human-readable schema_description, and optionally the tuning knobs. Creating a model from one -- "new_from_prototype($proto, %overrides)" here, or --prototype on iforest fit / iforest stream -- stamps the schema metadata into the model JSON, so every downstream consumer (iforest info, resumed streams, your own tooling) can tell which revision of the input schema a model was built against.

{
  "format": "Algorithm::Classifier::IsolationForest::Prototype",
  "version": 1,
  "class": "online",
  "schema_version": "2026.07.08-1",
  "schema_description": "HTTP request stream: method enum, path length, host entropy, raw byte count",
  "schema": {
    "feature_names": ["method", "path_len", "host_entropy", "bytes"],
    "feature_descriptions": {
      "method":       "HTTP request method, mapped via http_method_enum (-1 = unknown)",
      "path_len":     "character length of the request path",
      "host_entropy": "Shannon entropy of the Host header",
      "bytes":        "raw response byte count, passed through unmunged"
    },
    "mungers": {
      "method":       { "munger": "http_method_enum", "default": -1 },
      "path_len":     { "munger": "length",  "from": "path" },
      "host_entropy": { "munger": "entropy", "from": "host" }
    },
    "missing": "zero"
  },
  "params": {
    "n_trees": 150,
    "window_size": 4096,
    "max_leaf_samples": 32,
    "contamination": 0.02
  }
}

The fields, top to bottom...

- format :: required, always the string
      'Algorithm::Classifier::IsolationForest::Prototype'.  A prototype
      handed to load() (or a model handed to the prototype methods)
      dies with a clear message instead of half-working.

- version :: the prototype format version; this release reads version 1.
    default :: 1

- class :: required, 'batch' (this class) or 'online'
      (L<Algorithm::Classifier::IsolationForest::Online>).  Prototypes
      are self-describing; `iforest fit` refuses an online prototype
      and `iforest stream` refuses a batch one.  Two model types with
      the same variables means two prototype files.

- schema_version :: required opaque string, never parsed or compared
      numerically.  User-owned: bump it when the variable schema
      changes.

- schema_description :: required opaque free-text string describing
      what this variable schema is, so a model file explains itself
      months later.

- schema :: required object holding the variable schema.
      feature_names is required (order = CSV column order); the
      optional keys are feature_descriptions ('feature name => free
      text', every key must name an entry in feature_names, partial
      coverage fine), mungers (see L</MUNGERS>), missing, and -- batch
      prototypes only -- impute_with.  Unknown keys croak.

- params :: optional object of tuning knobs, whitelisted per class.
      Batch: n_trees, sample_size, max_depth, mode, extension_level,
      contamination, voting, seed.  Online: n_trees, window_size,
      max_leaf_samples, growth, subsample, contamination, seed.
      Unknown keys croak -- a typo'd knob silently falling back to its
      default is exactly the failure mode a prototype exists to
      prevent.  Machine-local knobs (use_c, use_openmp, use_openmp_fit,
      parallel_fit) are rejected: they describe the box the model runs
      on, not the model.

validate_prototype($proto)

Structurally validates a prototype -- a hashref or a JSON string -- and returns the decoded hashref; croaks describing the first problem found. Validation is structural only (no munger compilation), so it does not require Algorithm::ToNumberMunger even for a munger-bearing prototype.

my $proto = Algorithm::Classifier::IsolationForest->validate_prototype($json);

new_from_prototype($proto, %overrides)

Creates a fresh, unfitted model from a prototype (a hashref or a JSON string) and returns it -- an instance of whichever class the prototype's class field names, so like load() this is a single entry point for both model types. Croaks on any validation failure; a munger-bearing prototype compiles its plan here, so a bogus munger spec dies at creation (and needs Algorithm::ToNumberMunger installed).

%overrides merge over the prototype's params block -- per-run knobs like seed -- and are held to the same per-class whitelist. Overriding the schema itself (feature_names, feature_descriptions, mungers, missing, impute_with, schema_version, schema_description) croaks: the schema is the prototype's, full stop; edit the prototype.

my $oif = Algorithm::Classifier::IsolationForest->new_from_prototype(
    $proto_json,
    seed => 42,
);

load_prototype($path, %overrides)

"new_from_prototype($proto, %overrides)" from a file.

my $iforest = Algorithm::Classifier::IsolationForest->load_prototype(
    'proto.json', seed => 42 );

to_prototype

Returns a prototype JSON string extracted from this model: its variable schema (feature_names, feature_descriptions, mungers, missing policy) plus its current tuning knobs. This closes the loop -- extract a prototype from a good model and periodically create fresh models with an identical schema, the natural retrain workflow -- and means hand-writing a prototype is never mandatory.

Croaks when the model has no feature_names: a prototype's variable schema needs named variables. A model with no recorded schema_version / schema_description (fitted before prototype support, or without the knobs) gets placeholder values, since both are required in the file -- edit them in and bump from there. seed and max_depth resolved at fit time are not emitted; pass such per-run knobs as overrides when creating from the prototype.

my $proto_json = $iforest->to_prototype;

REFERENCES

Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

https://www.researchgate.net/publication/224384174_Isolation_Forest

https://ieeexplore.ieee.org/abstract/document/4781136

Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner (2020). Extended Isolation Forest. 1479 - 1489. 10.1109/TKDE.2019.2947676

https://ieeexplore.ieee.org/document/8888179

Yousra Chabchoub, Maurras Ulbricht Togbe, Aliou Boly, Raja Chiky (2022). An In-Depth Study and Improvement of Isolation Forest. IEEE Access, vol. 10, 10219 - 10237. 10.1109/ACCESS.2022.3144425 (the Majority Voting Isolation Forest implemented by voting => 'majority')

https://ieeexplore.ieee.org/document/9684896

Filippo Leveni, Guilherme Weigert Cassales, Bernhard Pfahringer, Albert Bifet, Giacomo Boracchi (2024). Online Isolation Forest. (the streaming variant implemented by Algorithm::Classifier::IsolationForest::Online)

https://arxiv.org/abs/2505.09593

https://github.com/ineveLoppiliF/Online-Isolation-Forest

https://proceedings.mlr.press/v235/leveni24a.html

To install Algorithm::Classifier::IsolationForest, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Algorithm::Classifier::IsolationForest

CPAN shell

perl -MCPAN -e shell
install Algorithm::Classifier::IsolationForest

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

NATIVE ACCELERATION (Inline::C and OpenMP)

Compile at install time (the prebuilt object)

Tuning the C build

Tuning the OpenMP runtime

GENERAL METHODS

new(%args)

decision_threshold

set_voting

feature_names

schema_version

schema_description

feature_descriptions

fit

fit_tagged(\@rows)

pack_data(\@data)

path_lengths(\@data)

predict(\@data, $threshold)

predict_tagged(\%row, $threshold)

tagged_row_to_array(\%row, $caller)

munge_rows(\@rows)

score_samples(\@data)

score_sample_tagged(\%row)

score_predict_samples

score_predict_sample_tagged(\%row, $threshold)

score_predict_split(\@data, $threshold)

MUNGERS

MODEL SAVE/LOAD METHODS

to_json

from_json($json)

save($path)

load($path)

PROTOTYPES

validate_prototype($proto)

new_from_prototype($proto, %overrides)

load_prototype($path, %overrides)

to_prototype

REFERENCES

Module Install Instructions

Keyboard Shortcuts