NAME
Algorithm::Classifier::IsolationForest - unsupervised anomaly detection via Isolation Forest or Extended Isolation Forest
SYNOPSIS
use Algorithm::Classifier::IsolationForest;
my @data = ([0.1, -0.2], [0.0, 0.1], [5.0, 6.0], ...);
# Classic, axis-parallel Isolation Forest
my $iforest = Algorithm::Classifier::IsolationForest->new(
n_trees => 100,
sample_size => 256,
seed => 42,
);
$iforest->fit(\@data);
my $scores = $iforest->score_samples(\@data); # arrayref, each in (0,1]
my $flags = $iforest->predict(\@data, 0.6); # arrayref of 0/1
# Save and reload
$iforest->save('model.json');
my $reloaded = Algorithm::Classifier::IsolationForest->load('model.json');
# Extended Isolation Forest (oblique hyperplane splits)
my $eif = IsolationForest->new(mode => 'extended', seed => 42);
$eif->fit(\@data);
# Parallel training (fork-based, Unix-like platforms): build the
# n_trees across several worker processes.
my $iforest = Algorithm::Classifier::IsolationForest->new(
n_trees => 200,
sample_size => 256,
seed => 42,
parallel_fit => 4, # 4 forked workers
);
$iforest->fit(\@data);
# Pre-pack a dataset to skip the per-call input-walk cost when the
# same data gets scored many times (interactive tuning, dashboards).
my $packed = $iforest->pack_data(\@data);
my $scores = $iforest->score_samples($packed);
my $flags = $iforest->predict($packed, 0.6);
# Get scores and labels as two flat arrayrefs in one call -- cheaper
# than score_predict_samples when you don't need the paired shape.
my ($s, $l) = $iforest->score_predict_split(\@data, 0.6);
DESCRIPTION
Isolation Forest (Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua, 2008) detects anomalies by random partitioning rather than by modelling normal points. Each tree repeatedly splits the data. Points that get isolated after only a few splits are likely anomalies. The score is the average isolation depth across many trees, normalised so values approach 1 for anomalies and stay below 0.5 for normal points.
In extended mode the module implements the Extended Isolation Forest variant. Each split is a random hyperplane instead of an axis-aligned cut, which removes the rectangular, axis-aligned bias in the score field and tends to help on elongated or multi-modal data.
psi refernced below is ψ or the pitchfork math symbol refrenced in paper, Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.
... or max samples.
https://www.researchgate.net/publication/224384174_Isolation_Forest
NATIVE ACCELERATION (Inline::C and OpenMP)
The scoring hot path (score_samples, predict, path_lengths, score_predict_samples, and score_predict_split) is automatically accelerated through Inline::C when it is installed and a working C compiler is reachable. If the toolchain also accepts -fopenmp and can link against libgomp, the per-point tree walk runs in parallel across all available CPU cores using OpenMP, and the extended-mode oblique dot product is vectorised via #pragma omp simd -- which on modern x86 compilers translates to an unrolled FMA / AVX gather chain that's substantially faster for high-feature-count extended models.
Detection happens once when the module is loaded; the compiled artefact is cached under _Inline/ and reused on subsequent runs. Three package variables report what the build picked up:
$Algorithm::Classifier::IsolationForest::HAS_C # 0/1
$Algorithm::Classifier::IsolationForest::HAS_OPENMP # 0/1
$Algorithm::Classifier::IsolationForest::HAS_SIMD # 0/1 (OpenMP 4.0+)
Neither dependency is required. Without Inline::C the module falls back to a pure-Perl implementation that produces identical results, just slower; without OpenMP the C backend runs single-threaded.
The bundled iforest accel subcommand performs a tiny fit + score and prints which backend is active, which is the recommended way to verify the build picked up the optional dependencies on a given machine.
The C backend is compiled with -O3 by default. Set the environment variable IF_NATIVE=1 before first load to add -march=native, which lets the compiler emit AVX2/AVX-512 gather and FMA instructions tuned to the build host. This typically gives a meaningful speedup on the extended-mode oblique dot product at high feature counts. The cached artefact under _Inline/ is pinned to the host CPU, so leave IF_NATIVE unset if the directory is shared across machines with different instruction-set support.
GENERAL METHODS
new(%args)
Inits the object.
- n_trees :: number of isolation trees in the ensemble
default :: 100
- sample_size :: sub-sample size used to build each tree... max samples
default :: 256
- max_depth :: per-tree height limit... if not defined is set to ceil(log2(psi))
default :: undef
- seed :: optional integer to seed srand with for reproducible trees...
see perldoc -f srand for more info. This number is processed via abs(int()).
default :: undef
- mode :: if it should be IF or EIF
axis :: classic axis-parallel splits (IF)
extended :: oblique hyperplane splits (EIF)
default :: axis
- extension_level :: extended mode only... how many features take partin each
split. 0 behaves like a single-feature (axis) cut; the
maximum (n_features - 1) uses every varying feature. undef
=> maximum. Clamped to [0, n_features - 1] at fit time.
- contamination :: expected fraction of anomalies, in (0, 0.5]. When given,
fit() learns a score threshold that flags this fraction of
the training set, and predict() uses it by default. undef
=> no learned threshold (predict() falls back to 0.5).
default :: undef
- parallel_fit :: positive integer N => build the trees across N forked
worker processes during fit(). Each worker gets a derived seed
(parent seed + worker_id * 1009) so the parallel fit is
reproducible across runs at fixed worker count -- but the trees
produced are NOT bit-identical to a serial fit with the same
seed, because the RNG draws happen in a different order.
Inference is unaffected. Falls back silently to serial on
platforms without a real fork() (e.g. Windows without Cygwin).
default :: undef (serial)
- use_c :: boolean, override whether the Inline::C scoring backend is
used. When false the instance falls back to pure Perl even if
the C backend compiled successfully. When true (or unset) the
C backend is used if available ($HAS_C).
default :: $HAS_C
- use_openmp :: boolean, override whether OpenMP parallel scoring is
used inside score_all_xs(). When false the C tree walk runs
single-threaded even if OpenMP was linked in. Ignored when
use_c is false (pure Perl has no OpenMP path).
default :: $HAS_OPENMP
Note: log2 under Perl is as below...
log($psi) / log(2)
decision_threshold
The score cutoff predict uses by default; undef unless contamination was set.
fit
Trains the model on the specified data.
The data taken is an array of arrays. Each sub-array is one sample and must contain one or more numeric features. All samples must have the same number of features. There is no upper limit on dimensionality.
@training_data = (
[ 3, 5 ],
[ 2.3, 1 ],
[ 5, 9 ],
...
);
# Three-feature example
@training_data = (
[ 1.0, 2.0, 3.0 ],
[ 1.1, 1.9, 3.1 ],
...
);
Below shows a example of building a gausing cluster and using that for training.
# so it is reproducible
srand(7);
# build a gaussian cluster and add a handful out outliers...
use constant PI => 3.14159265358979;
sub gaussian {
my ($mu, $sigma) = @_;
my $u1 = rand() || 1e-12;
my $u2 = rand();
my $z = sqrt(-2 * log($u1)) * cos(2 * PI * $u2);
return $mu + $sigma * $z;
}
# add some normal items
for (1 .. 500) {
push @data, [ gaussian(0, 1), gaussian(0, 1) ];
push @truth, 0;
}
# add some outliers
for (1 .. 20) {
my $angle = rand() * 2 * PI;
my $radius = 5 + rand() * 3; # distance 5..8 from the origin
push @data, [ $radius * cos($angle), $radius * sin($angle) ];
push @truth, 1;
}
$iforest->fit(\@training_data);
pack_data(\@data)
Returns an opaque, blessed wrapper around the input dataset that the scoring methods can use directly, skipping the per-call work of walking the arrayref-of-arrayrefs and converting each cell into a double. At high feature counts this is a meaningful win when the same dataset is scored repeatedly (e.g. interactive threshold tuning, dashboards, plotting that updates as parameters change).
Requires the Inline::C backend; croaks if use_c is false.
my $packed = $forest->pack_data(\@data);
# Now any of these accept either an arrayref or the packed wrapper:
my $scores = $forest->score_samples($packed);
my $flags = $forest->predict($packed, 0.6);
my ($s, $l) = $forest->score_predict_split($packed);
The wrapper has n_pts and n_feats accessors for introspection. The feature count is matched against the model on every call; passing a packed dataset built for a different feature count is a fatal error.
path_lengths(\@data)
Returns the mean isolation depth per sample, for inspection.
my @lenghts = $forest->path_lengths(\@data);
print "x, y, length\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$lenghts[$int]."\n";
$int++;
}
predict(\@data, $threshold)
Returns an arrayref of 0/1 labels for the specified data.
If theshold is not specified it uses whatever the set default.
my $results = $forest->predict(\@data, $threshold);
print "x, y, result\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$results->[$int]."\n";
$int++;
}
score_samples(\@data)
Returns an arrayref of anomaly scores, between 0 and 1.
Scores near 1 are strong anomalies (isolated quickly).
Scores well below 0.5 are normal.
Scores ~0.5 means the points are hard to tell apart.
my $scores = $forest->path_lengths(\@data);
print "x, y, length\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$scores->[$int]."\n";
$int++;
}
score_predict_samples
Returns a array ref of arrays. First value of each sub array is the score with the second being 0/1 for if it is a anomaly or not.
my $results = $forest->predict(\@data, $threshold);
print "x, y, score, result\n";
my $int=0;
while (defined($data[$int])) {
print $data[$int][0].', '.$data[$int][1].', '.$results->[$int][0].', '.$results->[$int][1]."\n";
$int++;
}
score_predict_split(\@data, $threshold)
Same data as "score_predict_samples" but returned as two flat arrayrefs instead of an arrayref-of-pairs. Allocates roughly half as many Perl SVs per point (no inner AV, no RV per row), so it is meaningfully faster when both scores and labels are wanted but the paired shape is not.
In list context returns ($scores_aref, $labels_aref).
my ($scores, $labels) = $forest->score_predict_split(\@data);
for my $i (0 .. $#$scores) {
printf "%s -> score %.4f, label %d\n",
join(',', @{ $data[$i] }), $scores->[$i], $labels->[$i];
}
$threshold defaults to the contamination-learned cutoff (if fit was called with contamination) or 0.5.
MODEL SAVE/LOAD METHODS
to_json
Returns a JSON representation of the module.
Required being fit having to be called.
my $json = $iforest->to_json;
from_json($json)
Init the object from the model in the specified JSON string.
my $iforest = Algorithm::Classifier::IsolationForest->from_json($json);
save($path)
Saves the model to the specified path.
$iforest->save($path);
load($path);
Init the object from the model in the specified file.
my $iforest = Algorithm::Classifier::IsolationForest->load($path);
REFERENCES
Liu, Fei Tony & Ting, Kai & Zhou, Zhi-Hua. (2008). Isolation Forest. 413 - 422. 10.1109/ICDM.2008.17.
https://www.researchgate.net/publication/224384174_Isolation_Forest
https://ieeexplore.ieee.org/abstract/document/4781136
Sahand Hariri, Matias Carrasco Kind, Robert J. Brunner (2020). Extended Isolation Forest. 1479 - 1489. 10.1109/TKDE.2019.2947676