NAME

Algorithm::Classifier::IsolationForest - unsupervised anomaly detection via Isolation Forest or Extended Isolation Forest

SYNOPSIS

use Algorithm::Classifier::IsolationForest;

my @data = ([0.1, -0.2], [0.0, 0.1], [5.0, 6.0], ...);

# Classic, axis-parallel Isolation Forest
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees     => 100,
    sample_size => 256,
    seed        => 42,
);
$iforest->fit(\@data);

my $scores = $iforest->score_samples(\@data);  # arrayref, each in (0,1]
my $flags  = $iforest->predict(\@data, 0.6);    # arrayref of 0/1

# Save and reload
$iforest->save('model.json');
my $reloaded = Algorithm::Classifier::IsolationForest->load('model.json');

# Extended Isolation Forest (oblique hyperplane splits)
my $eif = Algorithm::Classifier::IsolationForest->new(
    mode => 'extended',
    seed => 42,
);
$eif->fit(\@data);

# Parallel training (fork-based, Unix-like platforms): build the
# n_trees across several worker processes.
my $iforest = Algorithm::Classifier::IsolationForest->new(
    n_trees      => 200,
    sample_size  => 256,
    seed         => 42,
    parallel_fit => 4,        # 4 forked workers
);
$iforest->fit(\@data);

# Pre-pack a dataset to skip the per-call input-walk cost when the
# same data gets scored many times (interactive tuning, dashboards).
my $packed = $iforest->pack_data(\@data);
my $scores = $iforest->score_samples($packed);
my $flags  = $iforest->predict($packed, 0.6);

# Get scores and labels as two flat arrayrefs in one call -- cheaper
# than score_predict_samples when you don't need the paired shape.
my ($s, $l) = $iforest->score_predict_split(\@data, 0.6);

DESCRIPTION

Isolation Forest (Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua, 2008) detects anomalies by random partitioning rather than by modelling normal points. Each tree repeatedly splits the data. Points that get isolated after only a few splits are likely anomalies. The score is the average isolation depth across many trees, normalised so values approach 1 for anomalies and stay below 0.5 for normal points.

In extended mode the module implements the Extended Isolation Forest variant. Each split is a random hyperplane instead of an axis-aligned cut, which removes the rectangular, axis-aligned bias in the score field and tends to help on elongated or multi-modal data.

With voting => 'majority' the module implements the Majority Voting Isolation Forest (MVIForest) aggregation: each tree votes a sample anomalous or normal against the decision threshold and the label is the majority of the votes, with prediction stopping early once the majority is reached. Trees are built identically either way, so this composes with both axis and extended mode, and an existing model can be flipped between the two modes with "set_voting" without refitting; see voting under "new(%args)".

psi referenced below is ψ or the pitchfork math symbol referenced in the paper, Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

... or max samples.

https://www.researchgate.net/publication/224384174_Isolation_Forest

NATIVE ACCELERATION (Inline::C and OpenMP)

Both the scoring hot path (score_samples, predict, path_lengths, score_predict_samples, and score_predict_split) and the fit() tree builder are automatically accelerated through Inline::C when it is installed and a working C compiler is reachable. If the toolchain also accepts -fopenmp and can link against libgomp, the per-point tree walk runs in parallel across all available CPU cores using OpenMP, and the extended-mode oblique dot product is vectorised via #pragma omp simd -- which on modern x86 compilers translates to an unrolled FMA / AVX gather chain that's substantially faster for high-feature-count extended models.

fit()'s tree builder (subsampling plus the recursive axis/oblique split search) runs in C the same way when use_c is on, replacing the per-node Perl arrayref copying with plain int-array partitioning -- typically an order of magnitude faster, and dramatically more so at higher feature counts where the pure-Perl per-cell loop dominates. Its random draws go through the same generator rand()/srand() use internally, in the same call order the pure-Perl builder uses, so a given seed produces bit-identical trees whether use_c is on or off -- switching backends changes only how fast the model is built, not the model itself. On perls whose NV is wider than a C double (-Duselongdouble / -Dusequadmath) the pure-Perl builder rounds each stored value to double precision to preserve this parity; axis mode matches exactly, while extended mode can still differ on rare libm rounding ties (double vs long-double transcendentals).

By default this C builder is single-threaded per call, because Perl's RNG state isn't safe to share across OpenMP threads. Two ways to scale fit() across cores are available (see below for why they don't compose):

  • parallel_fit forks N worker processes, each building its share of the trees with the (still single-threaded) C builder. Fixed IPC/serialisation overhead per worker means this can cost more than it saves once a fit already completes in milliseconds; it's most useful once a single-process fit is large enough that the fork/Storable overhead is small relative to the work being split.

  • use_openmp_fit builds trees across OpenMP threads within a single process (one tree per thread), using a separate, thread-safe PRNG seeded per tree index instead of Perl's rand(). This means trees built with use_openmp_fit are not bit-identical to the default use_c path for the same seed -- but a fixed seed and n_trees still reproduce the same trees regardless of OMP_NUM_THREADS or how OpenMP schedules the work. It's off by default (unlike use_c/use_openmp, which only ever change speed, this changes which trees get built) and only takes effect when use_c is also on and OpenMP is linked in.

These two do NOT compose, despite both existing to parallelise fit(). A process that has run any OpenMP region -- including plain score_samples()/predict() with the default use_openmp -- and then fork()s (as parallel_fit does) hands each child a copy of libgomp's thread pool whose worker threads did not survive the fork. A child that then starts its own #pragma omp parallel region (as use_openmp_fit would) tries to reuse that now-invalid pool and hangs. This is a general limitation of combining fork() with OpenMP, not something fixable from Perl, so parallel_fit's forked workers always use the single-threaded C builder regardless of use_openmp_fit -- setting both just means parallel_fit wins and use_openmp_fit has no effect for that call.

Detection happens once when the module is loaded. When the distribution was installed with Inline::C available, the C backend was already compiled during make and the installed object is loaded directly (see "Compile at install time (the prebuilt object)" below); otherwise the backend is compiled on first load and the artefact is cached under _Inline/ and reused on subsequent runs. Five package variables report what the load picked up:

$Algorithm::Classifier::IsolationForest::HAS_C       # 0/1
$Algorithm::Classifier::IsolationForest::HAS_OPENMP  # 0/1
$Algorithm::Classifier::IsolationForest::HAS_SIMD    # 0/1 (OpenMP 4.0+)
$Algorithm::Classifier::IsolationForest::OPT_LEVEL   # e.g. "-O3 -march=native", '' if HAS_C is 0
$Algorithm::Classifier::IsolationForest::C_SOURCE    # 'prebuilt' / 'runtime', '' if HAS_C is 0

Neither dependency is required. Without Inline::C the module falls back to a pure-Perl implementation that produces identical results, just slower; without OpenMP the C backend runs single-threaded.

The bundled iforest accel subcommand performs a tiny fit + score and prints which backend is active (including the build flags below), which is the recommended way to verify the build picked up the optional dependencies on a given machine.

Compile at install time (the prebuilt object)

When Inline::C is usable while the distribution itself is being built, perl Makefile.PL arranges for the C backend to be compiled once during make and installed alongside the module like any XS object. At run time that object is loaded directly through XSLoader: no C compiler, no Inline modules, and no _Inline/ cache directory are needed on the machine the module ends up running on, and the first-load compile pause disappears entirely.

On x86-64 hardware from roughly the last decade, IF_ARCH=x86-64-v3 perl Makefile.PL is a reasonable configure line: it bakes AVX2 + FMA (without AVX-512) into the prebuilt object, which can speed up extended-mode scoring (how much is hardware-dependent -- benchmark with iforest bench before assuming) while avoiding the -march=native caveats described under "Tuning the C build". Bit-for-bit result parity with the pure-Perl backend is preserved either way (see IF_ARCH below).

The IF_* build flags described below are captured when perl Makefile.PL runs -- set them in the environment of that command, not of make -- and recorded in the generated Algorithm::Classifier::IsolationForest::BuildFlags module, which thereby also fixes what the prebuilt object was compiled with. At run time the recorded values serve as the defaults, so a process started with no IF_* variables set uses the prebuilt object as-is.

Setting IF_* variables at run time keeps working exactly as in releases without prebuilt support: if the requested flags differ from the recorded ones, the prebuilt object (compiled with the wrong flags for the request) is skipped and the module compiles at first load into _Inline/ -- which does need Inline::C and a compiler on that machine. Two related knobs exist:

  • IF_RUNTIME_BUILD=1 -- ignore the prebuilt object unconditionally and compile at first load even though the requested flags match the recorded ones. Useful when the installed object is suspect (built on a different CPU than it now runs on, linked against a libgomp that has since changed) or to A/B a fresh local build against the shipped one.

  • IF_INSTALL_BUILD=1 -- internal; set by the generated Makefile rule that performs the install-time compile. Not meant for manual use.

If the prebuilt object cannot be loaded for any reason (deleted, built against a different perl, version mismatch after an upgrade), the module quietly falls through the same chain as always: runtime Inline::C build first, pure Perl last.

Tuning the C build

These environment variables are read once, the first time the module is loaded, so they must be set before that -- e.g. in the shell before running a script, not via %ENV inside the script itself. They are also read by perl Makefile.PL to pick the flags baked into the prebuilt object (see above); at run time they override the recorded configure-time values, at the price of a runtime compile.

  • IF_NO_C=1 -- skip attempting to build the C backend entirely. Equivalent to constructing every instance with use_c => 0, but without needing to touch every call site; useful for a clean pure-Perl timing baseline, or to avoid the compile attempt's overhead/noise on a host known to lack a C compiler (the attempt already fails gracefully without this, so it's a convenience, not a correctness fix).

  • IF_OPT=-O2 (or -O0/-O1/-Os/-Og/-Oz) -- override the default -O3, e.g. to shorten build time while iterating, or work around a miscompile on an unusual toolchain. Invalid values are ignored with a warning rather than passed through, since this string reaches a compiler command line.

  • IF_ARCH=<value> -- adds -march=<value> so the compiler can target specific instruction-set extensions (AVX2 gather + FMA, etc.) for the extended-mode oblique dot product and the fit-time min/max scan's #pragma omp simd loops. Accepts values like x86-64-v3, skylake, or znver3 -- whatever your compiler's -march= accepts. Also validated (a restricted character set, not passed through as-is) for the same reason as IF_OPT. The special value none (or an empty string) opts out of any arch recorded at configure time, yielding a plain build. Whenever a -march is in effect the build also adds -ffp-contract=off: with FMA available the compiler would otherwise contract a*b+c into fused multiply-adds whose different rounding breaks the guarantee that use_c => 1 and use_c => 0 build bit-identical trees (the -march speedup comes from vectorization, not contraction, so this costs essentially nothing).

  • IF_NATIVE=1 -- shorthand for IF_ARCH=native; ignored if IF_ARCH is also set. Prefer a specific IF_ARCH value over this on a machine you don't control exclusively (a shared build host, a container base image): blanket -march=native pulls in whatever instruction sets the build host happens to have, including AVX-512 on some Intel CPUs -- which is known to trigger clock throttling under sustained heavy use and can make throughput worse than a conservative target like x86-64-v3 (AVX2, no AVX-512). If in doubt, benchmark both before committing to one.

  • IF_NO_OPENMP=1 -- build (or select) the serial C backend: the OpenMP compile attempt is skipped entirely, so the resulting object has no libgomp linkage and never starts an OpenMP runtime inside the process. This differs from OMP_NUM_THREADS=1, which merely runs the parallel code on one thread but still loads libgomp. Set at perl Makefile.PL time it yields a serial prebuilt object; set at run time against an OpenMP prebuilt install it triggers a runtime serial build (needing a compiler). An explicit IF_NO_OPENMP=0 re-enables OpenMP over a serial configure-time default.

Whichever of these are used, the cached artefact under _Inline/ is pinned to that build's instruction set -- delete _Inline/ (or use a separate one per host) if the directory is shared across machines with different CPUs, or a stale binary built for a narrower instruction set than the current host will simply keep being reused.

Tuning the OpenMP runtime

These are standard OpenMP environment variables libgomp already reads at run time (set before running your script, no module-specific handling needed) -- listed here because they matter most for exactly the workloads this module has: score_all_xs's per-point parallel loop and use_openmp_fit's per-tree parallel loop.

  • OMP_NUM_THREADS=N -- caps how many threads a parallel region uses. Useful to leave headroom for other work sharing the machine, or to pin down use_openmp_fit reproducibility checks (see its docs above: results don't depend on this, but it's a natural thing to vary when confirming that).

  • OMP_PROC_BIND=close / OMP_PLACES=cores -- on multi-socket or otherwise NUMA machines, pins each thread to a core near where its data already lives instead of letting the OS scheduler migrate threads across sockets mid-run. Both score_all_xs (each thread scans its own slice of the packed query buffer) and use_openmp_fit (each thread builds one tree from packed training data) benefit from this when the input is large enough to not fit comfortably in one socket's cache.

These cost nothing to try -- unlike IF_ARCH/IF_NATIVE, they're read fresh every run, not baked into a cached binary, so there's no downside to experimenting per invocation.

GENERAL METHODS

new(%args)

Inits the object.

- n_trees :: number of isolation trees in the ensemble
    default :: 100

- sample_size :: sub-sample size used to build each tree... max samples
    default :: 256

 - max_depth :: per-tree height limit... if not defined is set to ceil(log2(psi))
     default :: undef

 - seed :: optional integer to seed srand with for reproducible trees...
         see perldoc -f srand for more info. This number is processed via abs(int()).
     default :: undef

 - mode :: if it should be IF or EIF
      axis :: classic axis-parallel splits (IF)
      extended :: oblique hyperplane splits (EIF)
    default :: axis

 - extension_level :: extended mode only... how many features take partin each
         split. 0 behaves like a single-feature (axis) cut; the
         maximum (n_features - 1) uses every varying feature. undef
         => maximum. Clamped to [0, n_features - 1] at fit time.

  - contamination :: expected fraction of anomalies, in (0, 0.5]. When given,
        fit() learns a score threshold that flags this fraction of
        the training set, and predict() uses it by default. undef
        => no learned threshold (predict() falls back to 0.5).
      default :: undef

  - missing :: how fit() treats undef (missing) feature cells. Scoring always
        tolerates undef regardless of this setting; it governs fit().
          die    :: croak from fit() if the training data contains any
                    undef cell. Scoring still maps undef to 0 (the
                    long-standing behaviour), so a model fitted on clean
                    data can still score rows with missing features.
          zero   :: treat a missing cell as the value 0, at fit and score.
          impute :: replace a missing cell with the per-feature mean (or
                    median, see impute_with) learned from the present
                    values at fit time. The fill vector is stored on the
                    model and reused for scoring and persistence.
          nan    :: build feature ranges from present values only and route
                    a point missing the split feature to the right child,
                    consistently at fit and score time. Missingness is
                    preserved as signal rather than filled.
      default :: die

  - impute_with :: 'mean' or 'median'; the statistic used to compute the
        per-feature fill under missing => 'impute'. Ignored otherwise.
      default :: mean

  - voting :: how the per-tree results are aggregated at scoring time.
        Trees are built identically in both settings -- only aggregation
        changes -- so the knob composes with either mode (axis or
        extended) and an existing model may switch it after the fact with
        set_voting() (which relearns a contamination threshold for the
        new mode).
          mean     :: classic Isolation Forest: a sample's path lengths
                      across all trees are averaged and normalised into
                      one anomaly score; predict() thresholds that score.
          majority :: Majority Voting Isolation Forest (MVIForest;
                      Chabchoub, Togbe, Boly & Chiky 2022 -- see
                      REFERENCES). Each tree scores the sample on its own
                      (s_i = 2**(-h_i / c(psi))) and votes it anomalous
                      when s_i >= the decision threshold; predict() flags
                      the sample when more than half of the trees
                      (int(n_trees/2) + 1) vote anomalous, and stops
                      walking trees per sample as soon as the outcome is
                      decided. The threshold argument/default of the
                      predict methods is therefore the PER-TREE cutoff
                      here, not a forest-level score cutoff.
                      score_samples() returns the fraction of trees
                      voting anomalous -- still in [0, 1], but discrete
                      in steps of 1/n_trees. contamination composes: fit()
                      learns the per-tree cutoff that flags the requested
                      fraction of the training set.
      default :: mean

  - parallel_fit :: positive integer N => build the trees across N forked
        worker processes during fit(). Each worker gets a derived seed
        (parent seed + worker_id * 1009) so the parallel fit is
        reproducible across runs at fixed worker count -- but the trees
        produced are NOT bit-identical to a serial fit with the same
        seed, because the RNG draws happen in a different order.
        Inference is unaffected. Falls back silently to serial on
        platforms without a real fork() (e.g. Windows without Cygwin).
      default :: undef (serial)

  - use_c :: boolean, override whether the Inline::C backend is used for
        both scoring and fit()'s tree builder.  When false the instance
        falls back to pure Perl for both even if the C backend compiled
        successfully.  When true (or unset) the C backend is used if
        available ($HAS_C).  fit() with use_c on produces bit-identical
        trees to use_c off for the same seed -- only build speed differs.
      default :: $HAS_C

  - use_openmp :: boolean, override whether OpenMP parallel scoring is
        used inside score_all_xs().  When false the C tree walk runs
        single-threaded even if OpenMP was linked in.  Ignored when
        use_c is false (pure Perl has no OpenMP path).
      default :: $HAS_OPENMP

  - use_openmp_fit :: boolean, build fit()'s trees across OpenMP threads
        (one tree per thread) instead of the single-threaded C builder.
        Opt-in and off by default: unlike use_c/use_openmp, this changes
        which trees get built. Perl's RNG isn't safe to call from
        multiple OS threads sharing one interpreter, so this path seeds
        an independent PRNG per tree from the tree index rather than
        Drand01() -- trees differ from the use_c (single-threaded)
        and pure-Perl paths even with the same seed, though a fixed
        seed and n_trees still reproduce the same trees regardless of
        OMP_NUM_THREADS or scheduling. Does NOT compose with
        parallel_fit: a forked child starting its own OpenMP region
        after the parent process has used OpenMP for anything can
        hang (a general fork()+libgomp limitation), so parallel_fit's
        workers always use the single-threaded C builder regardless
        of this setting -- setting both just means parallel_fit wins.
        Ignored (clamped to 0) when use_c is false or OpenMP isn't
        linked in.
      default :: 0

Note: log2 under Perl is as below...

log($psi) / log(2)

decision_threshold

The score cutoff predict uses by default; undef unless contamination was set.

set_voting

Switches the scoring-time aggregation between 'mean' and 'majority' on an existing model and returns $self (so it chains). The forest itself is identical in both modes -- only the way per-tree results are combined changes -- so this never rebuilds a single tree.

$iforest->set_voting('majority');
$iforest->set_voting('mean', \@training_data);

The one thing that does not carry over is a contamination-learned "decision_threshold". That cutoff is a quantile of whichever per-point quantity the mode thresholds against -- the averaged anomaly score under 'mean', the per-tree majority pivot under 'majority' -- and those live in different spaces, so a threshold learned in one mode flags the wrong fraction in the other. When the model was fitted with contamination, set_voting therefore relearns the threshold for the target mode, which requires the original training data to be passed as the second argument (the model does not retain it). Switching a model that had no contamination needs no data: predict falls back to 0.5, which is meaningful in both modes.

Passing the current mode is a no-op (returns immediately, no data needed). Calling this before "fit" just records the mode for the eventual fit.

feature_names

Returns the arrayref of feature name strings stored with the model, or undef if none were provided at fit time.

my $names = $iforest->feature_names;

fit

Trains the model on the specified data.

The data taken is an array of arrays. Each sub-array is one sample and must contain one or more numeric features. All samples must have the same number of features. There is no upper limit on dimensionality.

@training_data = (
    [ 3, 5 ],
    [ 2.3, 1 ],
    [ 5, 9 ],
    ...
);

# Three-feature example
@training_data = (
    [ 1.0, 2.0, 3.0 ],
    [ 1.1, 1.9, 3.1 ],
    ...
);

Below shows an example of building a gaussian cluster and using that for training.

# so it is reproducible
srand(7);

# build a gaussian cluster and add a handful of outliers...

use constant PI => 3.14159265358979;
sub gaussian {
    my ($mu, $sigma) = @_;
    my $u1 = rand() || 1e-12;
    my $u2 = rand();
    my $z  = sqrt(-2 * log($u1)) * cos(2 * PI * $u2);
    return $mu + $sigma * $z;
}

# add some normal items
for (1 .. 500) {
    push @data,  [ gaussian(0, 1), gaussian(0, 1) ];
    push @truth, 0;
}
# add some outliers
for (1 .. 20) {
    my $angle  = rand() * 2 * PI;
    my $radius = 5 + rand() * 3;             # distance 5..8 from the origin
    push @data,  [ $radius * cos($angle), $radius * sin($angle) ];
    push @truth, 1;
}

$iforest->fit(\@data);

pack_data(\@data)

Returns an opaque, blessed wrapper around the input dataset that the scoring methods can use directly, skipping the per-call work of walking the arrayref-of-arrayrefs and converting each cell into a double. At high feature counts this is a meaningful win when the same dataset is scored repeatedly (e.g. interactive threshold tuning, dashboards, plotting that updates as parameters change).

Requires the Inline::C backend; croaks if use_c is false.

my $packed = $forest->pack_data(\@data);

# Now any of these accept either an arrayref or the packed wrapper:
my $scores = $forest->score_samples($packed);
my $flags  = $forest->predict($packed, 0.6);
my ($s, $l) = $forest->score_predict_split($packed);

The wrapper has n_pts and n_feats accessors for introspection. The feature count is matched against the model on every call; passing a packed dataset built for a different feature count is a fatal error.

path_lengths(\@data)

Returns an arrayref of the mean isolation depth per sample, for inspection.

my $lengths = $forest->path_lengths(\@data);

print "x, y, length\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$lengths->[$int]."\n";

    $int++;
}

predict(\@data, $threshold)

Returns an arrayref of 0/1 labels for the specified data.

If threshold is not specified it uses the contamination-learned cutoff (if fit was called with contamination), otherwise 0.5.

Under voting => 'majority' the threshold is the per-tree score cutoff each tree votes against, and a sample is labelled 1 when more than half of the trees (int(n_trees/2) + 1) vote it anomalous. Tree walking stops per sample as soon as the outcome is decided, so this is typically cheaper than scoring.

my $results = $forest->predict(\@data, $threshold);

print "x, y, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int]."\n";

    $int++;
}

predict_tagged(\%row, $threshold)

Predicts whether a single sample is an anomaly using a hashref of named feature values. The model must have been fitted (or loaded from a model that was fitted) with feature names stored via feature_names.

$threshold defaults the same way as in predict.

Returns a scalar 1 (anomaly) or 0 (normal).

my $label = $forest->predict_tagged(
    { cpu => 0.9, mem => 0.4, disk => 0.1 },
);

Croaks if the model has no stored feature names, if the hashref contains a key that is not a known feature name, or if a feature name is absent from the hashref.

tagged_row_to_array(\%row, $caller)

Validates a hashref of named feature values against the model's stored feature_names and returns a positional arrayref ready to pass to any of the scoring or prediction methods.

$caller is a string used in error messages to identify which method triggered the validation (pass the calling method's name).

my $vec = $forest->tagged_row_to_array(\%row, 'my_method');
# returns e.g. [0.9, 0.4, 0.1] ordered by feature_names

Croaks if:

  • $row is not a hashref

  • the model has no stored feature_names

  • the hashref contains a key that is not a known feature name

  • a feature name is absent from the hashref

score_samples(\@data)

Returns an arrayref of anomaly scores, between 0 and 1.

Scores near 1 are strong anomalies (isolated quickly).

Scores well below 0.5 are normal.

Scores ~0.5 means the points are hard to tell apart.

Under voting => 'majority' the returned value is instead the fraction of trees voting the sample anomalous at the model's decision threshold (the contamination-learned cutoff if present, otherwise 0.5) -- still in [0, 1], but discrete in steps of 1/n_trees, with a majority label corresponding to a fraction strictly above 0.5.

my $scores = $forest->score_samples(\@data);

print "x, y, score\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$scores->[$int]."\n";

    $int++;
}

score_sample_tagged(\%row)

Scores a single sample supplied as a hashref of named feature values. The model must have stored feature names (set via feature_names in new() or the -t CLI flag at fit time).

Returns a scalar anomaly score in (0, 1].

my $score = $forest->score_sample_tagged({ cpu => 0.9, mem => 0.4 });

Croaks if the model has no stored feature names, if the hashref contains a key that is not a known feature name, or if a feature name is absent from the hashref.

score_predict_samples

Returns an array ref of arrays. First value of each sub array is the score with the second being 0/1 for if it is a anomaly or not.

$threshold defaults the same way as in predict.

Under voting => 'majority' the score is the anomaly vote fraction at $threshold (used as the per-tree cutoff) and the label is the majority vote, matching score_samples/predict semantics in that mode.

my $results = $forest->score_predict_samples(\@data, $threshold);

print "x, y, score, result\n";

my $int=0;
while (defined($data[$int])) {
    print $data[$int][0].', '.$data[$int][1].', '.$results->[$int][0].', '.$results->[$int][1]."\n";

    $int++;
}

score_predict_sample_tagged(\%row, $threshold)

Scores and classifies a single sample supplied as a hashref of named feature values. The model must have stored feature names.

$threshold defaults the same way as in predict.

Returns a two-element arrayref [$score, $label], matching the per-row shape that score_predict_samples returns for each row.

my $pair = $forest->score_predict_sample_tagged({ cpu => 0.9, mem => 0.4 });
printf "score %.4f  anomaly %d\n", $pair->[0], $pair->[1];

Croaks if the model has no stored feature names, if the hashref contains a key that is not a known feature name, or if a feature name is absent from the hashref.

score_predict_split(\@data, $threshold)

Same data as "score_predict_samples" but returned as two flat arrayrefs instead of an arrayref-of-pairs. Allocates roughly half as many Perl SVs per point (no inner AV, no RV per row), so it is meaningfully faster when both scores and labels are wanted but the paired shape is not.

In list context returns ($scores_aref, $labels_aref).

my ($scores, $labels) = $forest->score_predict_split(\@data);

for my $i (0 .. $#$scores) {
    printf "%s -> score %.4f, label %d\n",
        join(',', @{ $data[$i] }), $scores->[$i], $labels->[$i];
}

$threshold defaults to the contamination-learned cutoff (if fit was called with contamination) or 0.5.

MODEL SAVE/LOAD METHODS

to_json

Returns a JSON representation of the model.

Requires fit to have been called.

my $json = $iforest->to_json;

from_json($json)

Init the object from the model in the specified JSON string.

my $iforest = Algorithm::Classifier::IsolationForest->from_json($json);

save($path)

Saves the model to the specified path.

$iforest->save($path);

load($path)

Init the object from the model in the specified file.

my $iforest = Algorithm::Classifier::IsolationForest->load($path);

REFERENCES

Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.

https://www.researchgate.net/publication/224384174_Isolation_Forest

https://ieeexplore.ieee.org/abstract/document/4781136

Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner (2020). Extended Isolation Forest. 1479 - 1489. 10.1109/TKDE.2019.2947676

https://ieeexplore.ieee.org/document/8888179

Yousra Chabchoub, Maurras Ulbricht Togbe, Aliou Boly, Raja Chiky (2022). An In-Depth Study and Improvement of Isolation Forest. IEEE Access, vol. 10, 10219 - 10237. 10.1109/ACCESS.2022.3144425 (the Majority Voting Isolation Forest implemented by voting => 'majority')

https://ieeexplore.ieee.org/document/9684896