Synopsis
Get basic statistical functions working in Perl as if they were part of List::Util, like min, max, sum, etc. I've used Artificial Intelligence tools such as Claude, Gemini, and Grok to write this as well as using my own gray matter. There are other similar tools on CPAN, but I want speed and a form like List::Util, which I've gotten here with the help of AI, which often required many attempts to do correctly. This is meant to call subroutines directly through eXternal Subroutines (XS) for performance and portability.
There are other modules on CPAN that can do PARTS of this, but this works the way that I want it to.
Functions/Subroutines
========================================================================
add_data
Add data to an existing hash or array reference. This function acts as the equivalent of adding new rows, as well as an ljoin (described below). It dynamically infers your target data structure, handles deeply nested records, and seamlessly coerces mismatched data shapes to preserve the structural integrity of your primary reference.
Hash of Hashes (HoH)
When the target is a Hash of Hashes, incoming hash keys update existing rows, and new keys create new rows.
$data = { 'Jack Smith' => { age => 30 } };
$n = {
'Jack Smith' => { # Update existing (Hash)
dept => 'Engineering'
},
'Jane Doe' => { age => 25, dept => 'Sales' }, # Add new (Hash)
'Invalid' => 'Not a reference' # Edge case safety
};
add_data($data, $n);
Resulting Structure:
{
"Jack Smith": {
"age": 30,
"dept": "Engineering"
},
"Jane Doe": {
"age": 25,
"dept": "Sales"
}
}
Hash of Arrays (HoA)
When the target is a Hash of Arrays, incoming arrays are pushed onto the existing arrays, appending the new elements, similarly to R's rbind.
$data = { 'Project Alpha' => [ 'task1', 'task2' ] };
$n = {
'Project Alpha' => [ 'task3' ], # Appends to existing array
'Project Beta' => [ 'task1', 'task2' ] # Creates new array row
};
add_data($data, $n);
Resulting Structure:
{
"Project Alpha": [ "task1", "task2", "task3" ],
"Project Beta": [ "task1", "task2" ]
}
Array of Hashes / Arrays (AoH / AoA)
add_data now natively supports Array references at the root level. When targeting an Array, it iterates through the source array and merges data at the corresponding indices.
$data = [
{ id => 1, name => 'Alice' }
];
$n = [
{ role => 'Admin' }, # Updates index 0
{ id => 2, name => 'Bob' } # Creates index 1
];
add_data($data, $n);
Resulting Structure:
[
{ "id": 1, "name": "Alice", "role": "Admin" },
{ "id": 2, "name": "Bob" }
]
Advanced Structural Coercion & Cross-Merging
add_data strictly enforces the primary structure of your target reference (determined by inspecting its outer and inner bounds). If you mix Array and Hash types, the function automatically coerces the incoming data to match the target.
1. Inner Coercion (Mixing Rows):
Target is HoH: Source Array rows are read in pairs and converted to key-value pairs.
Target is HoA: Source Hash rows are flattened into key-value pairs and pushed onto the array.
2. Root-Level Coercion (Mixing Outer Containers):
Target is Array, Source is Hash: The function evaluates the Hash keys as numeric indices. (e.g., source key
"0"merges into target array index[0]). Non-numeric keys are safely ignored.Target is Hash, Source is Array: The function converts the Array indices into stringified Hash keys. (e.g., source array index
[1]merges into target hash key"1").
Source is a mixed Hash. Keys dictate the target array index!
$n = {
'0' => { y => 20 }, # Merges into $data->[0]
'1' => [ 'z', 30 ], # Array pair coerced to Hash, creates $data->[1]
'ignored' => { k => 'v' } # Ignored: cannot map to an array index
};
add_data($data, $n);
Resulting Structure strictly remains an Array of Hashes:
[
{ "x": 10, "y": 20 },
{ "z": 30 }
]
NB: If add_data is called on a completely empty target reference (e.g., $data = {} or $data = []), it will intelligently infer the required inner structure (Hashes vs Arrays) by inspecting the first valid row of the source data.
aoh2hoa
aoh2hoa($aoh) — transpose an array-of-hashes (row-major) into a hash-of-arrays (column-major).
my $hoa = aoh2hoa([ { a => 1, b => 2 }, { a => 3 } ]);
# $hoa = { a => [1, 3], b => [2, undef] }
Rows go in, columns come out: each distinct key across the input rows becomes one output column, and the values are gathered down that column in row order.
Arguments
$aoh — an array ref of hash refs, one hash per row. This is the only argument, and it is required. Passing anything that is not an array ref is fatal:
aoh2hoa({ a => 1 }); # dies: argument must be an arrayref of hashrefs
Returns
A hash ref of array refs. Each key is a column name (the union of all keys seen across the rows); each value is an array ref holding that column's cells. Every column has exactly scalar @$aoh elements, so the result is rectangular even when the input is ragged.
Behavior
The column set is the union of every row's keys — a key that appears in only some rows still produces a full-length column, with undef in the rows that lacked it.
Each column is padded to exactly the row count. Cells missing from a given row come through as undef, including trailing gaps (a column whose last contributing row is early still runs the full length). These absent cells are cheap holes in the array, not stored SVs.
Values are copied (newSVsv), so the returned structure is independent of the input — mutating $aoh afterward won't disturb the result. The copy is shallow: a value that is itself a reference is copied the same way $col->[$i] = $row->{$k} would, i.e. the ref is duplicated but its referent is shared.
Keys are handled SV-first (hv_iterkeysv / hv_fetch_ent), so UTF-8 and otherwise non-trivial hash keys round-trip correctly.
A row that is not a hash ref is skipped rather than fatal: it contributes undef to every column at its index. So a stray undef or scalar in the input thins the columns at that position instead of dying.
Notes
The output column order follows hash iteration order and is therefore not guaranteed — sort the keys if you need a stable layout. Round-tripping through hoa2aoh (or the reverse) reconstructs the data but not necessarily the original key/row ordering, and rows originally absent a key will gain it as an explicit undef.
aoh2hoh
Index an Array-of-Hashes into a Hash-of-Hashes, keyed by the value of one column.
my $hoh = aoh2hoh($aoh, $key);
Where aoh2hoa transposes rows into columns, aoh2hoh indexes rows by a chosen field, turning a sequential list into a lookup table. The chosen field is treated as a primary key: it must be unique across the rows, and a repeat is fatal.
Signature
| Argument | Type | Meaning |
|---|---|---|
$aoh |
arrayref | The rows: an arrayref of hashrefs. |
$key |
scalar | The column name whose value indexes each row. |
Returns a hashref. Each top-level key is a row's $row->{$key} value; each value is a shallow copy of that row.
my $rows = [
{ id => 'p1', kd => 12.4, chain => 'A' },
{ id => 'p2', kd => 3.1, chain => 'B' },
];
my $by_id = aoh2hoh($rows, 'id');
# {
# p1 => { id => 'p1', kd => 12.4, chain => 'A' },
# p2 => { id => 'p2', kd => 3.1, chain => 'B' },
# }
$by_id->{p2}{kd}; # 3.1 -- O(1) lookup instead of a linear scan
Semantics
These choices are the parts most worth keeping in mind, because the AoH->HoH mapping is ambiguous where a transpose is not.
Duplicate keys are fatal. If two rows share the same key value, the call dies rather than silently dropping a row:
aoh2hoh([ { id => 'a', x => 1 }, { id => 'a', x => 9 } ], 'id');
# dies: aoh2hoh: duplicate key 'a' has >= 2 occurrences
This makes the chosen column an enforced primary key: the result is only returned if every row maps to a distinct bucket. If your data legitimately has repeats and you want to keep them, you want a hash-of-arrays-of-rows instead -- a different return shape. If you want last-wins or first-wins collapse, dedup the input before calling.
The key column is retained inside each inner hash (the copy is of the whole row). Drop it deliberately if you don't want the redundancy.
Shallow copy. Inner hashes are fresh, so adding or removing keys on the output never touches the input. But a value that is itself a reference is shared, exactly like $out{$rk}{$_} = $row->{$_}:
my $shared = [ 1, 2, 3 ];
my $out = aoh2hoh([ { id => 'a', data => $shared } ], 'id');
push @{ $out->{a}{data} }, 4; # $shared now has 4 elements too
A row that is not a hashref, or that lacks a defined value at $key, is fatal.
Numeric vs string keys collide. Hash keys are strings, so 1 and "1" map to the same bucket and therefore trip the duplicate-key die. Normalize the key column first if a row could carry both forms.
Use cases
Join / enrichment lookups. Build an index once, then attach fields from one dataset onto another by shared id without an O(n*m) nested loop -- and the duplicate-key die guarantees the join side really is keyed uniquely:
my $meta = aoh2hoh($pdb_metadata, 'pdb_id');
for my $hit (@$results) {
$hit->{resolution} = $meta->{ $hit->{pdb_id} }{resolution};
}
Primary-key validation. Because a repeat is fatal, the call doubles as an assertion that a column is unique -- a cheap way to catch a malformed table (duplicate accession, duplicate peptide id) at load time rather than downstream.
Random-access reshaping of tabular data. After parsing a CSV/TSV into an array of row-hashes, re-index by a primary key so downstream code can fetch a row by name rather than scanning. Pairs naturally with the CSV-parsing side of the toolkit.
Set membership and difference. exists $hoh->{$k} gives a cheap presence test, useful for asking which ids in one table are missing from another.
Relationship to aoh2hoa
| Function | Output shape | Indexed by | Typical question it answers |
|---|---|---|---|
aoh2hoa |
hash of arrayrefs | column name | "give me every value in column X" |
aoh2hoh |
hash of hashrefs | a row's key val | "give me the whole row whose id is Y" |
Reach for aoh2hoa when you want columns (vectors to feed a statistic or a plot); reach for aoh2hoh when you want addressable rows keyed by a unique field.
Implementation note
The operation is a single pass over the rows with one hash insert per row -- the same asymptotics in pure Perl as in XS, and Perl's hash operations are already C underneath. There is no meaningful speed or memory advantage to an XS implementation here, so pure Perl is preferred unless it must live in the same .xs for packaging parity. The duplicate check is a single exists per row and does not change that. (An XS version would croak on the duplicate before allocating the second copy, so there is no extra cleanup to manage.)
aov
Warning: assumes normal distribution
aov(
{
yield => [5.5, 5.4, 5.8, 4.5, 4.8, 4.2],
ctrl => [1, 1, 1, 0, 0, 0]
},
'yield ~ ctrl');
which returns
{
ctrl {
Df 1,
"F value" 25.6000000000001,
"Mean Sq" 1.70666666666667,
Pr(>F) 0.00718232855871859,
"Sum Sq" 1.70666666666667
},
Residuals {
Df 4,
"Mean Sq" 0.0666666666666665,
"Sum Sq" 0.266666666666666
}
}
You can also perform Two-Way ANOVA with categorical interactions using the * operator. The parser will implicitly evaluate the main effects alongside the interaction:
my $res_2way = aov($data_2way, 'len ~ supp * dose');
It is robust against rank deficiency; collinear terms will gracefully receive 0 degrees of freedom and 0 sum of squares, matching R's behavior.
Input Parameters
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
data_sv |
HashRef or ArrayRef |
(Required) | The dataset to analyze. Accepts a Hash of Arrays (HoA) or Array of Hashes (AoH). If no formula is provided, it must be an HoA to allow automatic stacking (mimicking R's stack() on a named list). |
|
formula_sv |
String |
undef |
A symbolic description of the model to be fitted. If omitted, the formula automatically defaults to 'Value ~ Group' and the input data is stacked. |
'yield ~ N * P' |
Output Variables
The function returns a single HashRef containing the evaluated statistical results. Because the keys map dynamically to the terms parsed from your formula, the structure will vary based on your inputs.
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
| (Term Name) | HashRef |
undef |
A nested hash for each independent term in the formula (e.g., 'Group', 'N:P'), containing its ANOVA table statistics. |
{'Df' => 1, 'Sum Sq' => 14.2, 'Mean Sq' => 14.2, 'F value' => 25.81, 'Pr(>F)' => 0.0004} |
Residuals |
HashRef |
undef |
A nested hash containing the residual (error) statistics for the fitted model. | {'Df' => 10, 'Sum Sq' => 5.5, 'Mean Sq' => 0.55} |
group_stats |
HashRef |
undef |
A nested hash containing descriptive statistics (mean and size / count) for every column evaluated in the original unstacked data structure. |
{'mean' => {'A' => 2.1, 'B' => 5.4}, 'size' => {'A' => 10, 'B' => 10}} |
omitting formula
As of version 0.07, in the case of an omitted formula, stacking is done:
aov(
{
yield => [5.5, 5.4, 5.8, 4.5, 4.8, 4.2],
ctrl => [1, 1, 1, 0, 0, 0]
},
);
is the equivalent of:
yield <- c(5.5, 5.4, 5.8, 4.5, 4.8, 4.2)
ctrl <- c(1, 1, 1, 0, 0, 0)
# Combine them into a named list (the R equivalent of your hash)
my_list <- list(yield = yield, ctrl = ctrl)
# Convert the list into a "long" dataframe
# This creates two columns: "values" and "ind" (the group name)
my_data <- stack(my_list)
# Rename columns for clarity (optional but good practice)
colnames(my_data) <- c("Value", "Group")
anova_model <- aov(Value ~ Group, data = my_data)
summary(anova_model)
in R
assign
Add new columns to a data frame, computed from the columns already there.
Usage
assign($df, new_name => sub { ... }, another => sub { ... }, ...);
$df— your data frame, in either shape:AoH — arrayref of row hashrefs:
[ {weight=>70, height=>1.75}, ... ]HoA — hashref of column arrayrefs:
{ weight=>[70,...], height=>[1.75,...] }
new_name => sub { ... }— one or more pairs. Each sub computes one cell of the new column and is called once per row:$_is the current row as a hashref, so you read other columns with$_->{colname}.$_[1]is the row's index (0-based), if you need it.whatever the sub returns becomes the cell value.
It changes $df in place and also returns it (handy for chaining).
Example
my $df = [
{ weight => 70, height => 1.75 },
{ weight => 90, height => 1.80 },
];
assign($df, bmi => sub { $_->{weight} / $_->{height} ** 2 });
# $df is now:
# [ { weight=>70, height=>1.75, bmi=>22.86 },
# { weight=>90, height=>1.80, bmi=>27.78 } ]
Good to know
Pairs run in order, so a later column can use one you just made:
assign($df, bmi => sub { $->{weight} / $->{height} ** 2 }, class => sub { $_->{bmi} > 25 ? 'high' : 'ok' }, # uses bmi );
Same recipe, both shapes. The exact same
sub { $_->{weight} / ... }works whether$dfis AoH or HoA; you always read the row through$_.It modifies your data frame. If you need to keep the original, pass a copy:
assign(clone($df), ...).Reusing a column name overwrites that column.
binom_test
binom_test answers one question: you ran a yes/no experiment n times and got x successes — is that consistent with some assumed success rate, or is it too far off to be chance? It is the exact binomial test, the same as R's binom.test.
A toddler and two cards
Show a toddler two cards each round and ask them to point at the one with the star. If he/she is only guessing, he/she will be right half the time, so the "pure guessing" success rate is p = 0.5.
You play 10 rounds and the toddler gets 6 right. Real skill, or just luck?
use Stats::LikeR 'binom_test';
my $r = binom_test(6, 10, p => 0.5); # 6 wins, 10 rounds, guessing rate 0.5
print $r->{p_value}; # 0.7539
The full result is a hashref:
{
statistic => 6, # times the toddler was right
parameter => 10, # rounds played
estimate => 0.6, # observed rate, 6/10
null_value => 0.5, # the "pure guessing" rate we test against
p_value => 0.7539,
conf_int => [0.262, 0.878],
conf_level => 0.95,
alternative => 'two.sided',
method => 'Exact binomial test',
}
Reading the p-value
The p-value is the chance of seeing a result at least this surprising if the toddler were really just guessing.
Here p = 0.75. That is enormous: a pure guesser scores 6/10 (or something even further from 5) about three times out of four. So 6/10 is completely ordinary luck — no evidence of skill.
The common cutoff is 0.05. Below it, you start to believe something real is going on. Above it, chance explains the result fine. 0.75 is nowhere close, so we call this just chance.
What "legit" would look like
Suppose the toddler had gone 9 for 10 instead:
my $r = binom_test(9, 10, p => 0.5);
print $r->{p_value}; # 0.0215
Now p = 0.02, under 0.05. A pure guesser almost never does that well, so this is good evidence the toddler can actually tell the cards apart.
The confidence interval
conf_int is the plausible range for the toddler's true success rate. For 6/10 it runs from about 0.26 to 0.88 — wide, and it comfortably includes 0.5. That overlap with the guessing rate is another way of seeing that luck cannot be ruled out. For 9/10 the interval would sit well above 0.5.
Options
pis the assumed success rate (default0.5).alternativeis'two.sided'(default),'less', or'greater'. Use'greater'when you only care whether the toddler beats guessing, not whether they do worse.conf_levelsets the interval width (default0.95).
You can also pass the counts as binom_test([6, 4]) — 6 right, 4 wrong — when you have wins and losses instead of wins and a total.
cfilter
Select columns out of a table and return it in the same shape. A column is the inner (second-level) key of a hash of hashes or an array of hashes, or the outer key of a hash of arrays:
use Stats::LikeR;
my %hoa = ( x => [1,2,3], y => [4,5,6], z => [0,0,0] );
cfilter(\%hoa, keep => ['x','y']); # { x => [1,2,3], y => [4,5,6] }
cfilter(\%hoa, remove => ['z']); # { x => [1,2,3], y => [4,5,6] }
cfilter takes exactly one of keep or remove. keep returns only the matching columns; remove returns everything except them. The result is the same shape as the input (HoH → HoH, HoA → HoA, AoH → AoH), with cell values copied and the original structure left untouched.
Selecting by name
Pass an array ref of column names. Naming a column that is not present in the data is an error (it catches typos), and a row that happens not to contain a kept column simply comes back without it:
my @aoh = ( { a => 1, b => 2 }, { a => 3 } );
cfilter(\@aoh, keep => ['b']); # [ { b => 2 }, {} ]
Selecting by a predicate
Instead of names, keep/remove accept a predicate — a CODE ref or a function name — evaluated once per column. It is called as
$predicate->($column_values, $column_name)
where $column_values is an array ref of the column's defined cells (undef and missing cells are dropped, so functions like sd get clean input). With keep, columns for which the predicate is true are kept; with remove, those columns are dropped.
# Keep only the constant columns (standard deviation zero):
my $const = cfilter(\%hoa, keep => sub { sd($_[0]) == 0 }); # { z => [0,0,0] }
# Drop the constant columns instead:
my $varying = cfilter(\%hoa, remove => sub { sd($_[0]) == 0 }); # { x=>..., y=>... }
# A bare function name resolves in Stats::LikeR:: (use a package for your own):
cfilter(\%hoa, keep => 'some_predicate');
A bare string is always treated as a function name, not a single column name, so to keep one column by name use an array ref: keep => ['x'].
Errors
cfilter dies (via croak) when:
neither
keepnorremoveis given, or both are,a named column is not present in the data,
the selector is neither an array ref nor a code ref / function name, or the function name cannot be resolved,
an unknown option is given, or the options are not
name => valuepairs,the data is not a hash/array reference of the expected shape (a hash of hash refs or array refs, or an array of hash refs).
chisq_test
The chisq_test function performs chi-squared contingency table tests and goodness-of-fit tests. It natively accepts both arrays and hashes (1D and 2D) and mathematically mirrors R's chisq.test(), returning a structured hash reference of the results.
For 2x2 matrices, Yates' Continuity Correction is applied automatically.
Accepted Inputs
| Input Type | Data Structure | Applied Test |
|---|---|---|
| 1D Array | [ $v1, $v2, ... ] |
Chi-squared test for given probabilities |
| 2D Array | [ [ $v1, $v2 ], [ $v3, $v4 ] ] |
Pearson's Chi-squared test (Yates' correction if 2x2) |
| 1D Hash | { key1 => $v1, key2 => $v2 } |
Chi-squared test for given probabilities |
| 2D Hash | { row1 => { c1 => $v1, c2 => $v2 } } |
Pearson's Chi-squared test (Yates' correction if 2x2) |
Output Object Structure
The function returns a single Hash Reference containing the following key-value pairs. The internal structure of expected and observed will always identically match the structure of your input.
| Key | Data Type | Description |
|---|---|---|
| data.name | String | Identifies the input type (e.g., "Perl ArrayRef" or "Perl HashRef"). |
| expected | Array/Hash Ref | The expected frequencies, matching the geometry of the input. |
| method | String | The specific statistical test applied. |
| observed | Array/Hash Ref | The original data passed to the function. |
| p.value | Float | The calculated p-value of the test. |
| parameter | Hash Ref | Contains the degrees of freedom (df). |
| statistic | Hash Ref | Contains the test statistic (X-squared). |
Two-Dimensional Array
Passing an Array of Arrays (AoA) triggers a standard Pearson's Chi-squared test. If the input is exactly a 2x2 matrix, Yates' continuity correction is applied automatically.
my $test_data = [
[762, 327, 468],
[484, 239, 477]
];
my $res = chisq_test($test_data);
Output:
{
'data.name' => 'Perl ArrayRef',
'expected' => [
[ 703.671381936888, 319.645266594124, 533.683351468988 ],
[ 542.328618063112, 246.354733405876, 411.316648531012 ]
],
'method' => "Pearson's Chi-squared test",
'observed' => [
[ 762, 327, 468 ],
[ 484, 239, 477 ]
],
'p.value' => 2.95358918321176e-07,
'parameter' => { 'df' => 2 },
'statistic' => { 'X-squared' => 30.0701490957547 }
}
1-Dimensional Array (Goodness of Fit)
Passing a flat Array Reference triggers a Goodness of Fit test, assuming equal expected probabilities across all items.
my $data = [10, 20, 30];
my $res = chisq_test($data);
Output:
{
'data.name' => 'Perl ArrayRef',
'expected' => [ 20, 20, 20 ],
'method' => 'Chi-squared test for given probabilities',
'observed' => [ 10, 20, 30 ],
'p.value' => 0.00673794699908547,
'parameter' => { 'df' => 2 },
'statistic' => { 'X-squared' => 10 }
}
2-Dimensional Hash (Pearson's Chi-squared)
Passing a Hash of Hashes (HoH) applies the exact same logic as a 2D Array, but preserves your nested string keys in the output. This is particularly useful when mapping data extracted directly from JSON, databases, or categorical mappings.
my $data = {
GroupA => { Success => 10, Failure => 15 },
GroupB => { Success => 20, Failure => 5 }
};
my $res = chisq_test($data);
Output:
{
'data.name' => 'Perl HashRef',
'expected' => {
'GroupA' => { 'Failure' => 10, 'Success' => 15 },
'GroupB' => { 'Failure' => 10, 'Success' => 15 }
},
'method' => "Pearson's Chi-squared test with Yates' continuity correction",
'observed' => {
'GroupA' => { 'Failure' => 15, 'Success' => 10 },
'GroupB' => { 'Failure' => 5, 'Success' => 20 }
},
'p.value' => 0.00937475878430379,
'parameter' => { 'df' => 1 },
'statistic' => { 'X-squared' => 6.75 }
}
One-Dimensional Hash (Goodness of Fit)
Flat Hash References evaluate Goodness of Fit while preserving your categorical keys in the expected and observed output blocks.
my $data = {
Apples => 10,
Oranges => 20,
Bananas => 30
};
my $res = chisq_test($data);
chunk
Split an array into contiguous, roughly equal groups by position. Unlike #qcut, chunk does not inspect values, sort, or compute cutpoints; it slices the array in the order given. Use it for batching work, paginating, or grouping non-numeric data such as strings.
Signature
my @groups = chunk($data, size => $n); # fixed elements per group
my @groups = chunk($data, parts => $k); # fixed number of groups
$data— an array reference. Its contents are never examined or sorted; elements are grouped in input order.
Pass exactly one of size or parts. Passing both, or neither, is a fatal error — the two readings of "equal groups" differ (see below), so the caller chooses which one is meant rather than relying on a default.
size => $n— each group holds$nelements; the final group holds whatever remains.parts => $k— the array is divided into$kgroups as equal as possible, with any remainder spread across the leading groups.
Return value
A list of array references, in input order — call it in list context:
my @groups = chunk($data, parts => 4);
Passing more parts than there are elements yields trailing empty groups (matching numpy.array_split), so no elements are ever dropped. An empty input array returns an empty list.
Examples
size fixes the elements per group; the last group is the remainder. Splitting the 26 letters into groups of five leaves one over:
my @groups = chunk(['a' .. 'z'], size => 5);
# 6 groups, sizes 5,5,5,5,5,1
# [a b c d e] [f g h i j] [k l m n o] [p q r s t] [u v w x y] [z]
parts fixes the number of groups; the remainder is absorbed by the leading groups instead:
my @groups = chunk(['a' .. 'z'], parts => 5);
# 5 groups, sizes 5,5,5,5,6
# [a b c d e] [f g h i j] [k l m n o] [p q r s t] [u v w x y z]
When the split is even the two forms agree:
my @a = chunk([1 .. 10], size => 2);
my @b = chunk([1 .. 10], parts => 5);
# identical: 5 groups of 2
Order is preserved — chunk never sorts. Sort the array yourself first if you want ordered groups:
my @groups = chunk([3, 1, 2], size => 2);
# ([3, 1], [2])
More parts than elements gives empty trailing groups, losing nothing:
my @groups = chunk([1, 2, 3], parts => 5);
# 5 groups; flattening them back gives (1, 2, 3)
col2col
Apply a two-column function to every pair of columns in a table and collect the answers in a hash of hashes.
It's the workhorse behind things like correlation matrices: give it your data and the name of a function that takes two columns (cor, t_test, …) and you get back every column compared against every other column.
use Stats::LikeR;
my %data = (
height => [ 170, 165, 180, 175 ],
weight => [ 70, 60, 85, 77 ],
age => [ 30, 41, 25, 38 ],
);
my $result = col2col(\%data, 'cor');
# $result->{height}{weight} == correlation of height vs weight
# $result->{height}{age} == correlation of height vs age
# ...and so on for every pair
========================================================================
Arguments
col2col( $data, $command, $cols, %options )
col2col( $data, $command, \%options ) # options in place of $cols
| Position | Argument | What it is |
|---|---|---|
| 1 | $data |
Your table, as a reference (see Data shapes below). |
| 2 | $command |
A code block or the name of a two-column function. |
| 3 | $cols |
(optional) Which columns to use as the "from" side. Omit for all. |
| 4+ | %options |
(optional) na, skip.errors, ⦠(see Options). |
========================================================================
Data shapes
col2col understands three layouts. In every case a column is the thing that gets compared, and the result is keyed by column name.
Hash of arrays (HoA) — keys are column names:
my %hoa = ( a => [1, 2, 3], b => [4, 5, 6] );
Hash of hashes (HoH) — First keys are row names, second keys are columns:
my %hoh = (
row1 => { a => 1, b => 4 },
row2 => { a => 2, b => 5 },
);
Array of hashes (AoH) — each element is a row, inner keys are columns:
my @aoh = ( { a => 1, b => 4 }, { a => 2, b => 5 } );
All three produce the same result for the same underlying numbers. Missing or undef cells are handled by the na option (below).
========================================================================
The command
The second argument is the function applied to each pair of columns. It is called as:
$command->( $column_a, $column_b ) # two ARRAY refs
so inside a block the two columns arrive in @_:
my $result = col2col(\%data, sub {
my ($x, $y) = @_; # $x and $y are array refs
cor($x, $y);
});
You can also pass a function name as a string. A bare name is looked up in Stats::LikeR::, so these two are equivalent:
col2col(\%data, 'cor');
col2col(\%data, sub { cor($_[0], $_[1]) });
========================================================================
The result
Always a hash of hashes: $result->{from}{to}.
for my $from (sort keys %$result) {
for my $to (sort keys %{ $result->{$from} }) {
printf "%s vs %s = %s\n", $from, $to, $result->{$from}{$to};
}
}
A column is never compared with itself, so $result->{a}{a} does not exist.
========================================================================
Restricting columns ($cols)
By default every column is used as the "from" side. The third argument narrows that down — handy when you only care about one variable.
# all columns vs all columns
my $all = col2col(\%data, 'cor');
# just ONE column vs every other column
my $one = col2col(\%data, 'cor', 'height');
my $cors = $one->{height}; # { weight => ..., age => ... }
# a FEW specific columns vs every other column
my $few = col2col(\%data, 'cor', ['height', 'weight']);
The "to" side is always every other column; $cols only limits the outer keys.
========================================================================
Options
Options can be given two ways:
col2col(\%data, 'cor', $cols, 'skip.errors' => 0); # after $cols
col2col(\%data, 'cor', { 'skip.errors' => 0 }); # hash ref, no $cols needed
The hash-ref form is convenient when you have no column restriction — it saves you from passing a placeholder. (A hash ref replaces $cols, so you can't use it to restrict columns at the same time; use the trailing form for that.)
na — how undefined values are handled
Real data has gaps. na decides what the function sees.
| Value | Behaviour | Use for |
|---|---|---|
'pairwise' (default) |
A row is used for a pair only if both columns are defined there. The two columns arrive aligned and equal-length. | Paired stats like cor. |
'omit' |
Each column drops its own undefined values independently. The two columns may end up different lengths. | Unpaired tests like t_test, kruskal_test, where a gap in one sample shouldn't discard a value in the other. |
'keep' |
Every row is passed through, undef and all. |
When your function does its own missing-data handling. |
# correlation: keep only complete pairs (the default)
col2col(\%data, 'cor');
# two-sample test: each column keeps its own values
col2col(\%data, 't_test', undef, na => 'omit');
col2col(\%data, 't_test', { na => 'omit' }); # same, no placeholder
rm.undef / rm.na remain as boolean aliases for backward compatibility: true means 'pairwise', false means 'keep'. Don't combine them with na.
skip.errors — keep going when a pair fails (default: true)
Some functions croak on degenerate input — for example cor dies if a column has zero variance. By default col2col traps that croak per pair: instead of aborting the whole run, it stores the first line of the error message in that cell, so the result tells you which pair failed and why. Every other cell is computed normally.
my $r = col2col(\%data, 'cor');
# a good pair: $r->{a}{b} == 0.83
# a bad pair: $r->{a}{const} eq 'cor: standard deviation of y is 0'
To restore the old "die on the first error" behaviour, turn it off:
col2col(\%data, 'cor', undef, 'skip.errors' => 0);
col2col(\%data, 'cor', { 'skip.errors' => 0 });
Only errors from your function are trapped. Mistakes in the call itself (unknown column, bad data, unknown function name, unknown option) always die.
========================================================================
Worked examples
Full correlation matrix:
my $m = col2col(\%data, 'cor');
One variable against all others, sorted strongest first, skipping failures:
my $col = 'Testosterone, total (nmol/L)';
my $cors = col2col($hoa, 'cor', $col)->{$col};
for my $other (sort { ($cors->{$b} // -2) <=> ($cors->{$a} // -2) } keys %$cors) {
next unless $cors->{$other} =~ /^-?\d/; # skip cells holding an error message
printf "%-30s % .3f\n", $other, $cors->{$other};
}
Two-sample test across columns of unequal completeness:
my $t = col2col($hoa, 't_test', undef, na => 'omit');
Find which pairs could not be computed:
my $m = col2col($hoa, 'cor');
for my $from (sort keys %$m) {
for my $to (sort keys %{ $m->{$from} }) {
my $v = $m->{$from}{$to};
warn "$from vs $to: $v\n" if defined $v && $v !~ /^-?\d/; # non-numeric = error
}
}
========================================================================
Gotchas
Your function receives two array refs,
($col_a, $col_b)— not a column and a name. Unpack withmy ($x, $y) = @_;.'pairwise'can still hit a constant subset. A column with overall variance can be flat on just the rows it shares with one partner, socormay still croak for that pair. With the defaultskip.errors, that shows up as a message in the single offending cell rather than killing the run.col2coldoes not modify your data. It reads the table and returns a new hash of hashes.In the error message, "x" is the first column and "y" is the second — i.e.
yis the inner ("to") key. So$result->{A}{B}reading…deviation of y is 0means columnBis the degenerate one for that pair.
cor
cor($array1, $array2, $method = 'pearson'),
that is, pearson is the default and will be used if $method is not specified.
Just like R, pearson, spearman, and kendall are available
If you provide an array of arrays (a matrix), cor will compute the correlation matrix automatically.
cor_test
my $result = cor_test(
'x' => $x,
'y' => $y,
alternative => 'two.sided',
method => 'pearson',
continuity => 1
);
cor_test safely handles undef (or NA) values seamlessly by computing over pairwise complete observations.
cov
cov($array1, $array2, 'pearson')
or
cov($array1, $array2, 'spearman')
or
cov($array1, $array2, 'kendall')
csort
Sort a table by a column or by a custom comparator. Works on both common Perl table shapes and can transpose between them on the way out. Stable, non-destructive.
Signature
my $sorted = csort($data, $by);
my $sorted = csort($data, $by, $output);
$data— your table, in either shape:AoH — arrayref of hashrefs (a list of rows):
[ {id=>1, v=>10}, {id=>2, v=>20} ]HoA — hashref of arrayrefs (parallel columns):
{ id=>[1,2], v=>[10,20] }
$by— how to sort:a column name (string), or
a comparator coderef using
$a/$b, just like Perl'ssort
$output(optional) —'aoh'or'hoa'(case-insensitive). Defaults to the input shape;undefalso means "same as input".
Returns a new structure. The input is never modified.
What it does
Column-name sort — numeric if every defined value in that column looks like a number, otherwise string comparison. Missing /
undefvalues sort last (matching R'sna.last).Comparator sort —
$aand$bare set in the comparator's own package, so a named sub from another package still sees its own$a/$b. For AoH they're the row hashrefs; for HoA they're per-row hashref views synthesized from the columns.Stable — equal keys keep their original order (merge sort, same as Perl
sortand Rorder()).Shape control — keep the input shape, or transpose: AoH→HoA builds the union of all row keys (ordered by first appearance, gaps filled with
undef); HoA→AoH emits one hashref per row.
Examples
# by column, ascending numeric, AoH in / AoH out
my $rows = csort($aoh, 'score');
# custom comparator (descending), HoA in / HoA out
my $cols = csort($hoa, sub { $b->{score} <=> $a->{score} });
# sort an AoH but hand it back as columns (HoA)
my $cols = csort($aoh, 'name', 'hoa');
Notes
Non-destructive: AoH output reuses the original row hashrefs (re-ordered); HoA output permutes every column in lockstep.
Empty and single-row tables are handled for all four in/out combinations.
An invalid
$outputvalue croaks.
dnorm
gives the density of the normal distribution, with the specified mean and standard deviation.
In other words, the predicted height of the value x, given a mean, standard deviation, and whether or not to use a log value.
returns a single scalar/number if a single value is given, otherwise returns an array reference.
Usage:
dnorm(4) # assumes a mean of 0 and standard deviation of 1
but default mean, standard deviation, and log can be passed as parameters:
$x = dnorm(0, mean => 0, sd => 2, 'log' => 0);
dropna
Drop missing data from a data frame, loosely modeled on pandas' dropna. Works on all three shapes: AoH [ {A=>..}, .. ], HoA { A=>[..], .. }, and HoH { r1=>{A=>..}, .. }.
Usage
# NA mode: drop rows that are undef in the named columns
dropna($df, cols => ['A', 'B']);
dropna($df, cols => ['A', 'B'], how => 'all');
# deletion mode: remove specific rows outright
dropna($df, rows => [2, 5]); # indices for AoH/HoA, keys for HoH
You pass exactly one of cols or rows.
cols — drop rows with missing values
Inspect only the named columns and drop the rows where they're undef. Columns you don't name are never inspected, but they stay aligned (their cell at a dropped row goes too). A missing key counts as undef.
how controls the threshold:
'any'(default) — drop a row if any named column is undef there.'all'— drop a row only if every named column is undef there.my $df = { A => [1, 2, undef], B => [1, 2, 3], C => [undef, 2, 4] }; dropna($df, cols => ['A', 'B']);
{ A => [1, 2], B => [1, 2], C => [undef, 2] }
Index 2 is dropped because A is undef there. C is not consulted, so its own undef at index 0 doesn't trigger a drop — but index 2 is still removed from C so every column stays the same length.
rows — delete specific rows
Remove exactly the rows you list — no missing-value logic. Rows are 0-based indices for AoH and HoA, or the outer keys for HoH. Anything not present is ignored.
dropna({ A => [10, 20, 30] }, rows => [1]); # { A => [10, 30] }
Good to know
Returns a new data frame; the original is never modified. For HoA the column arrays are rebuilt (cell values copied); for AoH and HoH the surviving row references are reused, not deep-copied (dropna never mutates a row). Clone the result if you need full independence.
It dies on: a non-ref data frame; passing both or neither of
cols/rows; a non-arrayref selector; acolsname absent from a non-empty HoA or AoH; an invalidhow; an unknown argument; or a hashref that mixes array and hash values (ambiguous HoA vs HoH).An empty AoH or HoA returns empty rather than erroring.
HoH results come back in hash order, since HoH rows are unordered.
filter
Return a new data frame containing only the rows of $df that match a predicate. The original $df is never modified.
my $adults = filter($df, col('age') >= 18);
filter accepts a predicate in one of two forms:
- 1. a
col()expression — a small, composable comparison built with overloaded operators, and - 2. a code reference — for anything the operators can't express (multiple columns, regexes, arbitrary logic), in the same spirit as the
filteroption of #.
Both filter and col are exported by default.
Arguments
| Position | Name | Description | |
|---|---|---|---|
| 1 | $df |
The data frame: an array of hashes (AoH, the default read_table output), a hash of arrays (HoA), or a hash of hashes (HoH, e.g. read_table with 'output.type' => 'hoh'). |
|
| 2 | predicate | A col() comparison object or a CODE reference. |
|
| 3 + | `'output.type' => 'aoh'\ | 'hoa'` | Optional. The shape of the returned frame. Omit it to keep the input's own shape. 'out' and 'output_type' are accepted aliases, and a bare filter($df, $pred, 'aoh') also works. |
The col() form
col('name') is a deferred reference to a column. It carries no data — only the column name — so it can be compared with a literal to build a predicate that filter evaluates once per row.
filter($df, col('age') >= 18); # keep rows where age >= 18
filter($df, col('sex') eq 'f'); # keep rows where sex is 'f'
filter($df, 18 <= col('age')); # operands may be in either order
| Kind | Operators | Comparison |
|---|---|---|
| Numeric | > < >= <= == != |
numeric (cell and value compared as numbers) |
| String | gt lt ge le eq ne |
string (cell and value compared as strings) |
Predicates compose with bitwise & (and), | (or), and ! (not):
filter($df, (col('age') > 18) & (col('sex') eq 'f')); # and
filter($df, (col('grp') eq 'a') | (col('grp') eq 'c')); # or
filter($df, !(col('x') > 100)); # not
Comparison operators bind more tightly than & and |, so (col('a') > 4) & (col('b') < 2) is parsed correctly, but the parentheses are recommended for readability.
> Note: C<< col('age') E<gt> 32 >> works because C<col('age')> is an object whose C<< E<gt> >> is overloaded. A B<bare string> cannot do this — C<< 'age' E<gt> 32 >> is computed by Perl to a plain boolean (the string numifies to 0) before C<filter> is ever called, so the column name is lost. Always wrap the column in C<col(...)>.
The code-reference form
For logic the operators can't express, pass a sub. It is called once per row; the row is a hash reference, available both as $_ and as the first argument $_[0]. Return a true value to keep the row.
filter($df, sub { $_->{x} > 4 && $_->{grp} eq 'a' });
filter($df, sub { $_->{name} =~ /^A/ });
filter($df, sub { $_->{age} % 2 == 0 }); # things col() has no operator for
filter($df, sub { $_[0]{score} > $_[0]{threshold} });
For a HoA, each row is assembled into a temporary { column => value, ... } hash before the sub (or the col() test) is called, so the same $_->{column} syntax works regardless of the input shape.
Choosing the output shape
By default filter returns a frame of the same shape as the input (AoH → AoH, HoA → HoA, HoH → HoH). Pass output.type to convert while filtering:
my $aoh = read_table('patients.csv'); # array of hashes
my $hoa = filter($aoh, col('Age') >= 18, 'output.type' => 'hoa');
# $hoa->{Age}, $hoa->{Sex}, ... are all the same length and row-aligned
The two selectable output types are 'aoh' and 'hoa'. 'hoh' is not selectable, because producing a hash of hashes would require choosing which column becomes the row key; an HoH input keeps its keys only when the output shape is left at the default (HoH → HoH).
Examples
use Stats::LikeR;
my $df = read_table('patients.csv'); # array of hashes
my $adults = filter($df, col('Age') >= 18); # numeric threshold
my $target = filter($df, (col('Age') >= 18) & (col('Sex') eq 'f')); # combine
my $flagged = filter($df, sub { $_->{ALT} > 40 || $_->{AST} > 40 }); # coderef
# hash of arrays in -> hash of arrays out (columns filtered in parallel)
my $hoa = read_table('patients.csv', 'output.type' => 'hoa');
my $sub = filter($hoa, col('Age') > 32);
# hash of hashes in -> the same row keys, fewer of them
my $hoh = read_table('patients.csv', 'output.type' => 'hoh');
my $keep = filter($hoh, col('Age') > 32);
# convert shape while filtering
my $as_hoa = filter($df, col('Age') > 32, 'output.type' => 'hoa');
Behavior and notes
The input is never modified.
filterbuilds and returns a new frame;$dfis left untouched.A missing or
undefcell never matches acol()comparison.col('x') > 0silently drops any row whosexis absent orundef; for numeric operators a non-numeric cell is likewise dropped. With a coderef,undefis whatever your sub makes of it.Rows are shared, not deep-copied, wherever possible. When an AoH or HoH row is kept (output left as AoH/HoH, or converted to
aoh), the returned frame references the same inner row hashes as the input. Mutating such a row in the result would also change it in the original. HoA inputs and anyhoaoutput build fresh arrays and fresh cell values.Keep-all / keep-none are well defined. A predicate true for every row returns the whole frame in the chosen shape; true for none returns an empty frame:
[]foraoh, a hash of empty (but present) columns forhoa, and{}forhoh.Supported shapes are AoH, HoA, and HoH. A non-reference, an AoH element that is not a hash reference, a HoA column that is not an array reference, or a HoH row that is not a hash reference all raise a descriptive error; a bare
col('x')with no comparison is also an error. An empty hash{}is treated as an empty frame.Perl 5.10 compatible. The
col()/operator layer is pure Perl (operator overloading building a per-row closure); filtering and any reshaping run in XS.
See also
read_table (whose filter option applies the same coderef convention while reading a file), col2col.
fisher_test
array reference entry
my $array_data = [
[10, 2],
[3, 15]
];
my $res1 = fisher_test($array_data);
which returns a hash reference:
{
alternative "two.sided",
conf_int [
[0] 2.75343836564204,
[1] 300.682787419401
],
conf_level 0.95,
estimate {
"odds ratio" 21.3053312750168
},
method "Fisher's Exact Test for Count Data",
p_value 0.000536724119143435
}
hash reference entry
$ft = fisher_test( {
Guess => {
Milk => 3, Tea => 1
},
Truth => {
Milk => 1, Tea => 3
}
});
get_union
my @all = get_union(\@a, \@b, \@c); # every distinct value, any list
my $count = get_union(\@a, \@b, \@c); # how many distinct values
Takes one or more array references and returns every value that appears in at least one of them. Duplicates collapse and the result keeps first-appearance order. In scalar context it returns the count. Values are compared by their string form (like Perl hash keys), so 1, "1" and 1.0 are one element, while a UTF-8 flagged string stays distinct from the same bytes without the flag. A non-array-ref argument or an undef element is fatal. Mirrors List::Compare's get_union.
my @a = (1, 2, 3, 3);
my @b = (3, 4);
my @u = get_union(\@a, \@b); # (1, 2, 3, 4)
get_unique
my @only_first = get_unique(\@a, \@b, \@c);
my $count = get_unique(\@a, \@b, \@c);
Takes one or more array references and returns the values that appear in the first reference and in no other reference; with a single reference it returns that list's distinct values. Duplicates collapse, the result keeps first-appearance order, and scalar context returns the count. Values are compared by string form (see get_union). A non-array-ref argument or an undef element is fatal. Mirrors List::Compare's get_unique, which likewise defaults to the first list.
my @a = (1, 2, 3);
my @b = (3, 4, 5);
my @c = (5, 6);
my @u = get_unique(\@a, \@b, \@c); # (1, 2) -- 3 is also in @b
glm
takes a hash of an array as input
my %tooth_growth = (
dose => [qw(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0)],
len => [qw(4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3 22.5
17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
15.2 21.5 17.6 9.7 14.5 10.0 8.2 9.4 16.5 9.7 19.7 23.3 23.6 26.4 20.0
25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0)],
supp => [qw(VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ)]
);
my $glm_teeth = glm(
data => \%tooth_growth,
formula => 'len ~ dose + supp',
family => 'gaussian'
);
In addition to the gaussian default, it fully supports logistic regression using the binomial family parameter via Iteratively Reweighted Least Squares (IRLS):
my $glm_bin = glm(formula => 'am ~ wt + hp', data => \%mtcars, family => 'binomial');
Input Parameters
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
formula |
String |
None (Required) | A symbolic description of the model to be fitted. Supports operators like +, :, *, ^, and -1 (to remove the intercept). |
'am ~ wt + hp', 'y ~ x - 1' |
data |
HashRef or ArrayRef |
None (Required) | The dataset containing the variables used in the formula. Accepts either a Hash of Arrays (HoA) or an Array of Hashes (AoH). | \%mtcars, [{x => 1, y => 2}, ...] |
family |
String |
'gaussian' |
A description of the error distribution and link function to be used in the model. Currently supports 'gaussian' (identity link) and 'binomial' (logit link). |
'binomial' |
Output variables
| Variable | Type | Description | Example |
|---|---|---|---|
aic |
Double |
Akaike's Information Criterion for the fitted model. | 123.45 |
boundary |
Integer (Boolean) |
1 if the fitted values computationally reached the 0 or 1 boundary (specific to the binomial family), 0 otherwise. |
0 |
coefficients |
HashRef |
A hash mapping the expanded model term names to their estimated coefficient values. | {'Intercept' => 1.5, 'wt' => -0.5} |
converged |
Integer (Boolean) |
1 if the Iteratively Reweighted Least Squares (IRLS) algorithm converged within the maximum iterations, 0 otherwise. |
1 |
deviance |
Double |
The residual deviance of the fitted model. | 15.2 |
deviance.resid |
HashRef |
A hash mapping data row names to their computed deviance residuals. | {'Mazda RX4' => 0.12} |
df.null |
Integer |
The residual degrees of freedom for the null model. | 31 |
df.residual |
Integer |
The residual degrees of freedom for the fitted model. | 30 |
family |
String |
The statistical family used to fit the model. | "gaussian" |
fitted.values |
HashRef |
A hash mapping data row names to the fitted mean values (the model's predictions on the scale of the response). | {'Mazda RX4' => 0.85} |
iter |
Integer |
The number of IRLS iterations performed before convergence or hitting the iteration limit. | 4 |
null.deviance |
Double |
The deviance for the null model (a baseline model containing only an intercept, or an offset of 0 if the intercept is removed). | 43.5 |
rank |
Integer |
The numeric rank of the fitted linear model (the number of estimated, non-aliased parameters). | 2 |
summary |
HashRef |
A nested hash mapping each term to its detailed summary statistics, including Estimate, Std. Error, t value / z value, and Pr(> t ) / Pr(> z ). Aliased parameters return "NaN". |
{'wt' => {'Estimate' => -0.5, 'Std. Error' => 0.1, ...}} |
terms |
ArrayRef |
An ordered list of the expanded term names included in the model matrix. | ['Intercept', 'wt', 'hp'] |
group_by
Take a hash of arrays, hash of hashes, or array of hashes, and group a column by another column.
my $aoh_data = [
{ 'Gender' => 'Male', 'Testosterone, total (nmol/L)' => 20.5 },
{ 'Gender' => 'Female', 'Testosterone, total (nmol/L)' => 1.8 },
{ 'Gender' => 'Male', 'Testosterone, total (nmol/L)' => 18.2 },
{ 'Gender' => 'Female' } # Intentional missing target value
];
as well as
$hoh_data = {
'Patient_A' => { 'Gender' => 'Male', 'Testosterone, total (nmol/L)' => 20.5 },
'Patient_B' => { 'Gender' => 'Female', 'Testosterone, total (nmol/L)' => 1.8 },
'Patient_C' => { 'Gender' => 'Male', 'Testosterone, total (nmol/L)' => 18.2 },
'Patient_D' => { 'Gender' => 'Female' }, # Intentional missing target value
'Patient_E' => { 'Gender' => 'Female', 'Testosterone, total (nmol/L)' => undef } # Explicit undef
};
and
my $hoa_data = {
'Gender' => ['Male', 'Female', 'Male', 'Female'],
'Testosterone, total (nmol/L)' => [22.1, 2.5, 19.4, undef ]
};
then run the function thus:
group_by( $hoa_data, 'Testosterone, total (nmol/L)', 'Gender');
The output can be thought of like a hash, with the first string broken down by the second.
all become hash of arrays:
{
Female [
[0] 1.8
],
Male [
[0] 18.2,
[1] 20.5
]
}
returns an empty array of hashes if neither target nor group keys are found.
Filtering
Data can be further broken down with filter/subs like in read_table:
my $testosterone = group_by($d, # group testosterone by "Gender"
'Testosterone, total (nmol/L)',
'Gender',
{ 'Race/Hispanic origin w/ NH Asian' => sub { $_ eq $n } },# filter
{ 'Testosterone, total (nmol/L)' => sub { $_ ne 'NA' } } # filter
);
where each filter filters on the columns, e.g. second hash keys.
hoa2aoh
Turn a hash-of-arrays into an array-of-hashes.
Usage
my $aoh = hoa2aoh($hoa);
$hoa— a hashref whose values are arrayrefs, one per column:{ id => [1, 2, 3], name => ['a', 'b', 'c'] }
returns — an arrayref of row hashrefs:
[ { id => 1, name => 'a' }, { id => 2, name => 'b' }, { id => 3, name => 'c' } ]
It builds a brand-new structure and copies every cell, so the result is completely independent of the input — changing one never affects the other.
Example
my $hoa = { mpg => [21, 22.8, 18.1], cyl => [6, 4, 6] };
my $aoh = hoa2aoh($hoa);
$aoh->[1]{mpg}; # 22.8
$hoa->{mpg}[1]; # still 22.8 — unaffected by edits to $aoh
Good to know
Row count is the length of the longest column. If columns have different lengths, the short ones are padded with
undefin the missing rows.undefcells are kept asundef.An empty hash, or one whose columns are all empty, gives back
[].It dies if the argument isn't a hashref, or if any column value isn't an arrayref (the message names the offending column).
See also
hoa2aoh is the reverse of aoh2hoa
hoa2hoh( \%hoa, $key )
Converts a hash-of-arrays (column-major) into a hash-of-hashes keyed by the $key column, i.e. { $rowname => { col => value, ... } }. Analogous to hoa2aoh, but rows are indexed by their $key value instead of positionally.
my %hoa = (
id => [ qw(a b c) ],
x => [ 1, 2, 3 ],
y => [ 4, 5, 6 ],
);
my $hoh = hoa2hoh( \%hoa, 'id' );
# { a => { id => 'a', x => 1, y => 4 }, b => {...}, c => {...} }
The $key column is retained in each inner row. Columns are copied by value. Shorter columns are padded with undef, matching hoa2aoh.
Dies if: the first argument is not a hashref of arrayrefs; $key is undef or names a missing/non-array column; the $key column holds an undefined value for any row; or two rows share the same $key value.
hoh2hoa
Convert a hash of hashes (row-major: outer key = row, inner key = column) into a hash of arrays (column-major: key = column, value = that column's cells down the rows).
use Stats::LikeR;
my %hoh = (
'r1' => { 'a' => 1, 'b' => 2 },
'r2' => { 'a' => 3, 'b' => 4 },
);
my $hoa = hoh2hoa(\%hoh);
which returns { a => [1, 3], b => [2, 4], }
Behavior
Columns are the union of every inner key, so a key that appears in only some rows still becomes a column.
Rows are emitted in sorted outer-key (row-name) order, and that one order is used for every column, so the arrays stay aligned and the result is reproducible regardless of hash ordering.
Gaps — a missing inner key, or a cell whose value is
undef— are filled with the fill value (seeundef.valbelow). Every column therefore has exactly one entry per row.Values are copied into the result; the original structure is left untouched.
An empty hash of hashes returns an empty hash of arrays (it is not an error).
Options
Options are passed as trailing name => value pairs.
| Option | Default | Meaning |
|---|---|---|
undef.val |
undef |
Value used to fill a missing key or an undef cell. Any defined scalar works, including 0 and ''. Passing undef keeps the default. |
row.names |
(none) | If set to a string, an extra column of that name is added holding the sorted row labels, aligned with the data. Dies if the name collides with an existing column. |
# Ragged input with an explicit fill string:
my %ragged = (
'r1' => { 'a' => 1, 'b' => 2 },
'r2' => { 'a' => 3, 'c' => 9 },
);
my $hoa = hoh2hoa(\%ragged, 'undef.val' => 'NA');
# {
# a => [1, 3 ],
# b => [2, 'NA'],
# c => ['NA', 9 ],
# }
# Keep the row labels as a column:
my $with_ids = hoh2hoa(\%ragged, 'row.names' => 'id');
# {
# id => ['r1', 'r2'],
# a => [1, 3 ],
# b => [2, undef],
# c => [undef, 9 ],
# }
Errors
hoh2hoa dies (via croak) when:
the argument is not a hash reference,
any value in the hash is not itself a hash reference,
an unknown option is given, or the options are not
name => valuepairs,row.namesis not a plain string, or it names an already-present column.
hist
Computes the histogram of the given data values, operating in single $O(N)$ pass performance. It returns the bin counts, computed breaks, midpoints, and density.
my $res = hist([1, 2, 2, 3, 3, 3, 4, 4, 5], breaks => 4);
If breaks is not explicitly provided, it defaults to calculating the number of bins using Sturges' formula.
intersection
Returns the set intersection (∩) of a list of array references: the values that appear in every array ref given.
use Stats::LikeR;
my @i = intersection([1, 2, 3], [2, 3, 4]); # (2, 3)
my @t = intersection([1, 2, 3, 4], [2, 3, 4], [3, 4]); # (3, 4)
my $n = intersection([1, 2, 3], [2, 3, 4]); # 2
Every argument must be an array reference: each one is treated as a set. Unlike mean and uniq, bare scalars are not accepted; passing a non-reference (or a non-array reference) croaks.
The result is deduplicated and ordered by first appearance in the first array ref. Duplicate values within any single ref are counted once, so intersection([1, 2, 2, 3], [2, 3, 3, 4]) is (2, 3), not (2, 2, 3).
Values are compared by stringification — the same eq semantics used by uniq. 1, 1.0, and "1" are treated as equal, while "3" and "3.0" are distinct. The UTF-8 flag is part of the comparison key, so a UTF-8 string and a byte-identical non-UTF-8 string are kept separate.
In list context intersection returns the shared values; in scalar context it returns the cardinality (the number of shared values).
With a single array ref, the result is simply that ref's unique values. If any ref is empty, the intersection is empty.
intersection croaks on degenerate or ill-formed input, reporting the offending position:
intersection(); # croaks: intersection needs >= 1 array ref
intersection([1, 2], 3); # croaks: argument 1 is not an array ref
intersection([1, undef, 3]); # croaks: undefined value at array ref index 1 (argument 0)
This matches the undef-handling of mean and uniq and the rest of the numeric reducers in Stats::LikeR.
kruskal_test
Essentially the test determines if all groups have the same median (same distribution) (an excellent review is at https://library.virginia.edu/data/articles/getting-started-with-the-kruskal-wallis-test)
Performs a Kruskal-Wallis rank sum test, see https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kruskal.test
hash of array entry
I feel that this is better, and more easily read, than what you get in R:
my %x = (
'normal.subjects' => [2.9, 3.0, 2.5, 2.6, 3.2],
'obs. airway disease' => [3.8, 2.7, 4.0, 2.4],
'asbestosis' => [2.8, 3.4, 3.7, 2.2, 2.0]
);
$kt = kruskal_test(\%x);
R-like array entry
my @xk = (2.9, 3.0, 2.5, 2.6, 3.2); # normal subjects
my @yk = (3.8, 2.7, 4.0, 2.4); # with obstructive airway disease
my @zk = (2.8, 3.4, 3.7, 2.2, 2.0); # with asbestosis
my @x = (@xk, @yk, @zk);
my @g = (
(map {'Normal subjects'} 0..4),
(map {'Subjects with obstructive airway disease'} 0..3),
map {'Subjects with asbestosis'} 0..4
);
my $kt = kruskal_test(\@x, \@g);
ks_test
The Kolmogorov–Smirnov test checks whether two samples are drawn from the same distribution (two-sample), or whether a single sample is drawn from a given reference distribution (one-sample). It works by comparing the empirical cumulative distribution functions (ECDFs) and measuring the largest gap between them.
Two-sample form — pass two array references:
$ks = ks_test(\@x, \@y);
$ks = ks_test(\@x, \@y, alternative => 'greater');
One-sample form — pass one array reference and the name of a reference CDF. Currently only 'pnorm' is supported, i.e. the standard normal distribution (mean 0, standard deviation 1):
$ks = ks_test(\@x, 'pnorm');
Arguments may be given positionally (as above) or by name:
$ks = ks_test(x => \@x, y => \@y, alternative => 'less', exact => 1);
Non-numeric and undefined elements are silently dropped before the test runs.
alternative selects which gap between the ECDFs is measured:
'two.sided'(default) — the largest gap in either direction, D = sup |F_x − F_y|.'greater'— the largest gap where x's ECDF rises above the other, D⁺ = sup (F_x − F_y).'less'— the largest gap in the other direction, D⁻ = sup (F_y − F_x).
These follow R's ks.test convention: 'greater'/'less' describe which CDF lies above the other, which (because a higher CDF means smaller values) is the opposite of which sample tends to be larger.
exact controls how the p-value is computed. Omit it to let the test choose: the exact distribution is used for small samples (two-sample when nx·ny 10000, one-sample when n < 100) and the asymptotic (Kolmogorov limiting) approximation otherwise. Pass exact => 1 to force the exact computation or exact => 0 to force the asymptotic one. Exact p-values cannot be computed when the data contain ties; if ties are present on the exact path, the test warns and falls back to the asymptotic p-value. (The exact one-sample test is only available for the two-sided alternative; a one-sided one-sample request also falls back to asymptotic.)
Return value
ks_test returns a hash reference with four keys:
statistic— the KS statistic for the chosenalternative: D, D⁺, or D⁻. It is the maximum distance between the two ECDFs (or, for the one-sample test, between the ECDF and the reference CDF), always in the range [0, 1]. Larger values mean the distributions are further apart.p_value— the probability, under the null hypothesis that the samples share a distribution, of observing a statistic at least this large. It is clamped to [0, 1]; a small value (e.g. < 0.05) is evidence against the null.method— a human-readable description of exactly what was run, handy for logging or reproducing a result. One of:"Two-sample Kolmogorov-Smirnov exact test","Two-sample Kolmogorov-Smirnov test (asymptotic)","One-sample Kolmogorov-Smirnov exact test", or"One-sample Kolmogorov-Smirnov test (asymptotic)".alternative— the alternative hypothesis that was applied ('two.sided','greater', or'less'), echoed back so the result is self-describing.
For example:
my $ks = ks_test(\@x, \@y);
if ($ks->{p_value} < 0.05) {
printf "reject H0: D=%.4f, p=%.4g (%s)\n",
$ks->{statistic}, $ks->{p_value}, $ks->{method};
}
ljoin
Consider a hash: $h{$row}{$col}, and another hash $i{$row}{$col2}. ljoin will add information for $col in %i for each $row to %h, where $row exists in both %h and %i. Similar to cbind in R.
For example,
{
"Jack Smith" {
age 30
}
}
and a second hash, { "Jack Smith" { dept "Engineering" }, "Jane Doe" { age 25 } }
in this case, running ljoin(\%h, \%i) will modify \%h to result:
{
"Jack Smith" {
age 30,
dept "Engineering"
}
}
lm
This is the linear models function.
$lm = lm(formula => 'mpg ~ wt + hp', data => $mtcars);
where $mtcars is a hash of hashes
lm also supports generating interaction terms directly within the formula using the * operator:
my $lm = lm(formula => 'mpg ~ wt * hp^2', data => \%mtcars);
If your data contains missing numbers (NA or undef), lm handles listwise deletion dynamically to ensure mathematical integrity before fitting.
the dot operator also works:
$lm = lm(formula => 'y ~ .', data => $dot_data);
Lonly
my @left_only = Lonly(\@left, \@right);
my $count = Lonly(\@left, \@right);
Takes exactly two array references and returns the values in the left list that are absent from the right list. Duplicates collapse, the result keeps left-list order, and scalar context returns the count. Values are compared by string form (see get_union). A non-array-ref argument, an undef element, or anything other than two references is fatal. Mirrors List::Compare's get_Lonly.
my @a = (1, 2, 3, 4);
my @b = (3, 4, 5);
my @l = Lonly(\@a, \@b); # (1, 2)
matrix
my $mat1 = matrix(
data => [1..6],
nrow => 2
);
You can also pass byrow => 1 if you want the matrix populated row-wise instead of column-wise.
As of version 0.10, parameters do not need to be named, so that matrix works more like R:
my $d = matrix(rnorm(32000), 1000, 32);
works as data, nrow, and ncol
max
max(1,2,3);
or
my @arr = 1..8;
max(@arr, 4, 5)
as of version 0.02, max will die if any undefined values are provided
mean
mean(1,2,3);
or
my @arr = 1..8;
mean(@arr, 4, 5)
or
mean([1,1], [2,2]) # 1.5
as of version 0.02, mean will die if any undefined values are provided
median
works like mean, taking array references and arrays:
median( $test_data[$i][0] )
as of version 0.02, median will die if any undefined values are provided
min
min(1,2,3);
or
my @arr = 1..8;
min(@arr, 4, 5)
as of version 0.02, min will die if any undefined values are provided
mode
Takes either an array or an array reference, and returns an array of the most common scalars (numbers or strings)
@arr = mode([1,3,3,3]); # returns (3)
@arr = mode('a','a','c','c','z'); # returns ('a', 'c')
oneway_test
A one-way test for equality of group means that, unlike aov/ANOVA, does not assume equal variances. By default it performs Welch's one-way test (the same default as R's oneway.test), so the residual degrees of freedom are usually fractional. Pass var_equal => 1 for the classic equal-variance form.
use Stats::LikeR qw(oneway_test);
Input
oneway_test accepts your data in one of three shapes. In every case each group is a vector of at least two numeric observations.
| Shape | What it means | Group labels |
|---|---|---|
Hash of arrays { a => [...], b => [...] } |
Each key is a group (R's stack() view of a named list) |
the hash keys |
Array of arrays [ [...], [...] ] |
Each element is a group | "Index 0", "Index 1", ⦠|
Hash + formula { resp => [...], grp => [...] }, formula => 'resp ~ grp' |
Long-format columns split by a factor column | the distinct values of the factor |
Options
| Option | Default | Meaning |
|---|---|---|
var_equal (alias var.equal) |
0 (false) |
0 â Welch's test (unequal variances). 1 â pooled-variance test. |
formula |
none | 'response ~ factor'. Only valid with a hash input; an error with an array of arrays. |
Data validation
Every observation must be defined and numeric; an undef or non-numeric cell makes the call die with the offending group and position. This matches the rest of Stats::LikeR (mean, sum, cor, … all die on undef) and prevents missing values from being silently treated as 0.
Each group needs at least two observations, and you need at least two groups.
Output
A hash reference with three top-level keys:
| Key | Value |
|---|---|
factor name (Group, or the formula's factor, e.g. supp) |
the between-groups row: Df, Sum Sq, Mean Sq, F value, Pr(>F) |
Residuals |
the within-groups row: Df, Sum Sq, Mean Sq (Df is fractional under Welch) |
group_stats |
{ mean => { group => mean, ⦠}, size => { group => n, ⦠} } |
Examples
Hash of arrays (each key is a group)
my $res = oneway_test({
yield => [5.5, 5.4, 5.8, 4.5, 4.8, 4.2],
ctrl => [1, 1, 1, 0, 0, 0 ],
});
{
Group => {
Df => 1,
"Sum Sq" => 61.6533333333333,
"Mean Sq" => 61.6533333333333,
"F value" => 177.504798464491,
"Pr(>F)" => 1.31343255160843e-07,
},
Residuals => {
Df => 9.81767348326473, # fractional: Welch correction
"Sum Sq" => 3.47333333333333,
"Mean Sq" => 0.353783749200256,
},
group_stats => {
mean => { ctrl => 0.5, yield => 5.03333333333333 },
size => { ctrl => 6, yield => 6 },
},
}
Array of arrays (groups named by index)
my $res = oneway_test([
[5.5, 5.4, 5.8, 4.5, 4.8, 4.2],
[1, 1, 1, 0, 0, 0 ],
]);
Identical to the hash form, except group_stats is keyed by position:
group_stats => {
mean => { "Index 0" => 5.03333333333333, "Index 1" => 0.5 },
size => { "Index 0" => 6, "Index 1" => 6 },
}
Long format with a formula
When your data is in columns rather than pre-split groups, name the response and factor columns with a formula. The factor's values become the groups and the factor's name becomes the top-level key:
my $res = oneway_test(
{
len => [4.2, 11.5, 7.3, 16.5, 17.3, 13.6, 23.6, 18.5, 33.9],
supp => [qw(VC VC VC OJ OJ OJ HI HI HI)],
},
formula => 'len ~ supp',
);
# $res->{supp}, $res->{Residuals}, $res->{group_stats} ...
Classic equal-variance form
my $res = oneway_test(\%groups, var_equal => 1); # or 'var.equal' => 1
Notes
The default (Welch) does not require equal group sizes or equal variances; the pooled form (
var_equal => 1) assumes equal variances.formulais only meaningful for a hash input. Passing it with an array of arrays is an error.Group order in the output is not guaranteed for hash inputs (it follows hash iteration order); read results by name, not position.
Avoid naming a factor
Residualsorgroup_statsin a formula, since those are reserved top-level keys in the result.
p_adjust
Returns array of false-discovery-rate-corrected p-values, where methods available are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr"
my @q = p_adjust(\@pvalues, $method);
power_t_test
$test_data = power_t_test(
n => 30, delta => 0.5,
sd => 1.0, sig_level => 0.05
);
It also allows configuring the test type (type => 'one.sample', 'two.sample', 'paired') and alternative hypothesis (alternative => 'one.sided'). You can also pass strict => 1 to strictly evaluate both tails of the distribution.
| Parameter | Type | Default | Description |
|---|---|---|---|
n |
Float | undef |
Number of observations (per group for two-sample, pairs for paired). |
delta |
Float | undef |
True difference in means. |
sd |
Float | 1.0 | Standard deviation. |
sig_level |
Float | 0.05 | Significance level (Type I error probability). Also accepts sig.level. |
power |
Float | undef |
Power of test (1 minus Type II error probability). |
type |
String | "two.sample" |
Type of t-test: "two.sample", "one.sample", or "paired". |
alternative |
String | "two.sided" |
One- or two-sided test: "two.sided", "one.sided", "greater", or "less". |
strict |
Boolean | 0 (False) | Use strict interpretation of two-sided power calculations. |
tol |
Float | ~1.22e-4 |
Numerical tolerance used for the internal root-finding algorithm. |
prcomp
Principal Component Analysis
Options
| Option | Type | Default | Description |
|---|---|---|---|
center |
Boolean | 1 (True) |
If true, the variables are shifted to be zero-centered before the analysis takes place. |
scale |
Boolean | 0 (False) |
If true, the variables are scaled to have unit variance before the analysis takes place. Note: If a column has zero variance, the function will croak to prevent division by zero. |
retx |
Boolean | 1 (True) |
If true, the rotated data (the original data multiplied by the rotation matrix) is returned under the key x. |
tol |
Number | undef |
A value indicating the magnitude below which components should be omitted. Components are omitted if their standard deviation is less than or equal to tol times the standard deviation of the first component. |
rank |
Integer | undef |
Optionally specify a strict limit on the number of principal components to return. The function will return min(rank, rows, columns) components. |
Results
Returned Data Structure
The prcomp function returns a HashRef containing the following keys representing the results of the Principal Component Analysis:
| Key | Type | Description |
|---|---|---|
sdev |
ArrayRef[Number] | The standard deviations of the principal components. Mathematically, these are the square roots of the eigenvalues of the covariance matrix. |
rotation |
ArrayRef[ArrayRef] | A 2D array representing the matrix of variable loadings (the eigenvectors). Each inner array represents a row, and the columns correspond to the principal components. |
x |
ArrayRef[ArrayRef] | A 2D array containing the rotated data (often referred to as PCA scores). This is the original data projected onto the principal components. Note: Only present if the retx option is true. |
center |
ArrayRef[Number] or 0 |
The centering values used (typically the column means). Returns false (0) if centering was disabled. |
scale |
ArrayRef[Number] or 0 |
The scaling values used (typically the column standard deviations). Returns false (0) if scaling was disabled. |
varnames |
ArrayRef[String] | The sorted names of the original variables. Note: Only present if the input data was a Hash of Arrays (HoA) or a Hash of Hashes (HoH). |
Using array of arrays
my $aoa = [
[2, 4],
[4, 2],
[6, 6]
];
my $pca = prcomp($aoa);
which returns
{
center [
[0] 4,
[1] 4
],
rotation [
[0] [
[0] 0.707106781186547,
[1] 0.707106781186548
],
[1] [
[0] 0.707106781186548,
[1] -0.707106781186547
]
],
scale 0,
sdev [
[0] 2.44948974278318,
[1] 1.4142135623731
],
x [
[0] [
[0] -1.41421356237309,
[1] -1.4142135623731
],
[1] [
[0] -1.4142135623731,
[1] 1.41421356237309
],
[2] [
[0] 2.82842712474619,
[1] 2.22044604925031e-16
]
]
}
Hash of Arrays
my $hoa = { B => [4, 2, 6], A => [2, 4, 6] };
my $pca = prcomp($hoa);
predict
R-style prediction for the fitted objects returned by lm and glm. It rebuilds each row's linear predictor from the model's coefficients and (for glm) applies the inverse link.
Usage
my $fit = lm(formula => 'mpg ~ wt + hp', data => $train);
my $yhat = predict($fit, $newdata); # predictions on new rows
my $resp = predict($logit_fit, $newdata); # glm: response scale (default)
my $eta = predict($logit_fit, $newdata, type => 'link'); # linear predictor
my $fitted = predict($fit); # no newdata -> stored fitted.values
$model— a fittedlm/glmhashref.predictreads itscoefficients(and, forglm, itsfamily).$newdata— a HoA, AoH, or HoH of new observations. Omit it (or passundef) to get the model's ownfitted.valuesback.type—'response'(default) returns predictions on the response scale (the inverse link applied — logistic for binomial);'link'returns the linear predictor. Forlmand gaussianglmthe link is the identity, so the two are the same.
What it returns
A hashref keyed by row name → prediction, exactly like lm/glm key fitted.values: a row.names column (or HoH key) if present, otherwise 1-based integer labels.
my $m = lm(formula => 'y ~ x + I(x^2)', data => $train);
my $p = predict($m, { x => [1, 2, 3] });
# { 1 => ..., 2 => ..., 3 => ... }
How it works
For each new row the prediction is
eta = Intercept + Σ coef[term] · term(row)
where each term is evaluated with the same engine used to fit the model, so interactions (x:z → product) and transforms (I(x^2) → power) behave identically to fitting. Coefficients that the fit marked aliased (stored as NaN) contribute nothing, just as they were excluded from the fitted values. For glm with family => 'binomial' and type => 'response', eta is passed through the logistic function 1 / (1 + exp(-eta)); otherwise eta is returned as is.
A consequence worth noting: predicting on the training data reproduces the model's fitted.values for any model built from continuous terms, interactions, or I() transforms.
Good to know
A prediction comes back as NaN when a required term can't be evaluated in the new data (a missing column, or a value that makes the term undefined).
Factors are a limitation. The fitted object stores only the dummy term names (e.g.
genderM), not the underlying factor levels, sopredictcannot re-expand a raw categorical column in new data. Either pass pre-expanded 0/1 dummy columns whose names match the coefficient names, or extendlm/glmto retain the factor levels.It dies on: a model that isn't a hashref or has no
coefficients; an invalidtype; ornewdatathat isn't a HoA/HoH hashref or AoH arrayref.
qcut
Equal-frequency binning of a numeric column, which is the analog of pandas qcut. Where cut would slice a value range into equal-width intervals (and dump most of a skewed distribution into one bin), qcut chooses cutpoints so each bin holds roughly the same number of observations. This is the binning you usually want for ranked-list work: deciles, quartiles, top-5% tranches.
Cutpoints are computed by linear interpolation between order statistics, the same method as numpy/pandas, so results match pandas.qcut exactly. Bins are right-closed, (a, b], with the lowest bin closed on both ends, [a, b], so the minimum value is always included.
Signature
qcut($data, $q, %options)
$data— an array reference of numbers.undefentries are treated as missing (NA): they are skipped when computing cutpoints and, when codes are requested, come back asundefin their original positions.$q— either a positive integer (the number of equal-frequency bins) or an array reference of probabilities in[0, 1]giving explicit cut boundaries, e.g.[0, 0.5, 0.95, 1].
For a usage reminder at the prompt, call qcut('h') (or qcut('H')); it dies with a short help message.
What it returns
By default qcut returns the edge vector as a flat list — the cheap, common query — so call it in list context:
my @edges = qcut($data, 4); # ($e0, $e1, $e2, $e3, $e4)
The per-element bin assignment (the expensive part) is opt-in. Ask for it with codes => 1 and you get an array reference parallel to $data:
my $codes = qcut($data, 4, codes => 1);
Ask for both in a single pass and you get two references, ($codes, $edges):
my ($codes, $edges) = qcut($data, 4, codes => 1, edges => 1);
Options
edges => 1— include the edge vector. On by default; turned off automatically when you request codes, so set it explicitly to get both.codes => 1— include the 0-based integer bin codes.labels => [...]— map the bin codes onto your own labels (impliescodes => 1). The list length must equal the number of bins.labels => 'interval'— label each element with its interval string, e.g.(3.25, 5.5](also implies codes).duplicates => 'drop'— if tied data produces non-unique cutpoints, merge them into fewer bins instead of dying. The default,'raise', throws an error (as pandas does).
Examples
Quartile edges (the default). The cutpoints match pandas exactly:
my @edges = qcut([1 .. 10], 4);
# @edges = (1, 3.25, 5.5, 7.75, 10)
Bin codes. They are 0-based; note the tie distribution matches pandas (inner bins take 2 here, outer bins 3):
my $codes = qcut([1 .. 10], 4, codes => 1);
# $codes = [0, 0, 0, 1, 1, 2, 2, 3, 3, 3]
Edges and codes together, computed in one pass:
my ($codes, $edges) = qcut([1 .. 10], 4, codes => 1, edges => 1);
Equal frequency on clean data — 100 values into 4 bins of 25:
my $codes = qcut([1 .. 100], 4, codes => 1);
# 25 elements in each of bins 0, 1, 2, 3
An explicit probability vector, for an asymmetric top-5% tranche:
my @edges = qcut([1 .. 100], [0, 0.5, 0.95, 1]);
my $codes = qcut([1 .. 100], [0, 0.5, 0.95, 1], codes => 1);
# bin 0: lower half (50), bin 1: next 45%, bin 2: top 5%
Named labels instead of integer codes (implies codes):
my $labels = qcut([1 .. 10], 4, labels => [qw/Q1 Q2 Q3 Q4/]);
# ['Q1','Q1','Q1','Q2','Q2','Q3','Q3','Q4','Q4','Q4']
Interval-string labels:
my $iv = qcut([1 .. 10], 4, labels => 'interval');
# $iv->[0] eq '[1, 3.25]'
# $iv->[-1] eq '(7.75, 10]'
Missing values are ignored for cutpoints, and (when codes are requested) pass straight through:
my $codes = qcut([1, 2, undef, 4, 5, 6, 7, 8, 9, 10], 4, codes => 1);
# $codes->[2] is undef; the rest are binned as usual
Tied data and duplicates. Heavy ties can make adjacent cutpoints equal; the default raises, 'drop' merges:
my @tied = ((0) x 8, 1, 2, 3, 4);
qcut(\@tied, 4); # dies: bin edges are not unique
my @edges = qcut(\@tied, 4, duplicates => 'drop');
# fewer than 5 edges; the empty quantile bands are collapsed
Get the usage summary and stop:
qcut('h'); # dies with the help text above
quantile
Calculates sample quantiles using R's continuous Type 7 interpolation.
my $quantile = quantile('x' => [1..99], probs => [0.05, 0.1, 0.25]);
If the probs parameter is omitted, it behaves identically to R by defaulting to the 0, 25, 50, 75, and 100 percentiles (c(0, .25, .5, .75, 1)). The returned hash keys match R's standardized naming convention (e.g., "25%", "33.3%").
Ronly
my @right_only = Ronly(\@left, \@right);
my $count = Ronly(\@left, \@right);
Takes exactly two array references and returns the values in the right list that are absent from the left list. Duplicates collapse, the result keeps right-list order, and scalar context returns the count. Values are compared by string form (see get_union). A non-array-ref argument, an undef element, or anything other than two references is fatal. Mirrors List::Compare's get_Ronly, and is the reverse of Lonly: Ronly(\@a, \@b) equals Lonly(\@b, \@a).
my @a = (1, 2, 3, 4);
my @b = (3, 4, 5);
my @r = Ronly(\@a, \@b); # (5)
rbinom
Create a binomial distribution of numbers
my $binom = rbinom( n => $n, prob => 0.5, size => 9);
It hooks directly into Perl's internal PRNG system, respecting srand() seeds.
read_table
I've tried to make this as simple as possible, trying to follow from R:
my $test_data = read_table('t/HepatitisCdata.csv');
options
| Option | Description | Example |
|---|---|---|
comment |
Comment character, by default # |
comment => % or whatever; does not apply with header |
output.type |
data type for output: array of hash, hash of array, or hash of hash | 'output.type' => 'aoh' |
filter |
Only take in rows with a certain filter | filter => { Sex => sub {$_ eq 'f'} } |
row.names |
include row names in retrieved data; off by default | |
sep |
field separator character; synonym with delim |
sep => "\t" |
delim |
field separator character; synonym with sep |
delim => "\t" |
output types can be AOH (aoa), HOA (hoa), HOH (hoh)
read_table($filename, 'output.type' => 'aoh');
read_table($filename, 'output.type' => 'hoa');
and, like Text::CSV_XS, filters can be applied in order to save RAM on big files:
$test_data = read_table(
't/HepatitisCdata.csv',
filter => {
Sex => sub {$_ eq 'f'} # where "Sex" is the column name, and "$_" is the value for that column
},
'output.type' => 'aoh'
);
the default delimiter is , Suffixes .csv and .tsv are automatically detected from file names, but if specified, are overridden by delim and/or sep. sep is given priority.
rnorm
Make a normal distribution of numbers, with pre-set mean mean, standard deviation sd, and number n.
my ($rmean, $sd, $n) = (10, 2, 9999);
my $normals = rnorm( n => $n, mean => $rmean, sd => $sd);
runif
Make an approximately uniform distribution into an array
named arguments
my $unif = runif( n => $n, min => 0, max => 1);
where n is the number of items, the values are between min and max
positional args
this is to match R's behavior:
runif( 9 )
will make 9 numbers in [0,1]
runif(9, 0, 99)
will match n, min, and max respectively
sample
take a sample of hash or array slices.
my $h = sample(\%h, 4); # take 4 hash keys and their values into $h
or, alternatively, with arrays:
my $arr = sample(\@arr, 3); # take 3 indices of an array
scale
my @scaled_results = scale(1..5);
You can also pass an options hash to disable centering or scaling:
my @scaled_results = scale(1..5, { center => false, scale => 1 });
It fully supports matrix operations. By passing an array of arrays, scale processes the data column by column independently:
my $scaled_mat = scale([[1, 2], [3, 4], [5, 6]]);
sd
my $stdev = sd(2,4,4,4,5,5,7,9);
Correct answer is 2.1380899352994
sd can accept both array references as well as arrays:
my $stdev = sd([2,4,4,4,5,5,7,9]);
As of version 0.02, sd will croak/die if any undefined values are provided.
seq
Works as closely as I can to R's seq, which is very similar to Perl's for loops. Returns an array, not an array reference.
Standard integer sequence
say 'seq(1, 5):';
my @seq = seq(1, 5);
say join(', ', @seq), "\n";
say 'seq(1, 2, 0.25):';
@seq = seq(1, 2, 0.25);
Fractional steps
say 'seq(1, 2, 0.25):';
@seq = seq(1, 2, 0.25);
say join(", ", @seq), "\n";
for (my $idx = 2; $idx >= 1; $idx -= 0.25) { # count down to pop
is_approx(pop @seq, $idx, "seq item $idx with fractional step");
}
Negative steps
say 'seq(10, 5, -1):';
@seq = seq(10, 5, -1);
say join(", ", @seq), "\n";
for (my $idx = 5; $idx <= 10; $idx++) { # count down to pop
is_approx(pop @seq, $idx, "seq item $idx with negative step");
}
shapiro_test
tests to see if an array reference is normally distributed, returns a p-value and a statistic
my $shapiro = shapiro_test(
[1..5]
);
and returns the hash reference:
{
p.value 0.589650577093106,
p_value 0.589650577093106,
statistic 0.960870680168535,
W 0.960870680168535
}
sum
returns sum, but using both arrays and array references.
my $test_data = [1..8];
sum($test_data)
which I prefer, compared to List::Util's required casting into an array:
sum(@{ $test_data });
which passing a reference is shorter and much easier to read. Stats::LikeR, however, will work for both
as of version 0.02, sum will cause the script to die if any undefined values are provided
summary
Analogous to R's summary, but does not deal with outputs from other functions. summary only describes data as it is entered. An option nrows or its synonym nrow specifies the maximum number of rows that will print.
array of array input
my @arr;
foreach my $i (0..18) {
push @arr, runif(22);
}
and then summary(\@arr), or summary(@arr)
---------------------------------------------------------------------------
Index # values Min. 1st Qu. Median Mean 3rd Qu. Max.
---------------------------------------------------------------------------
0 22 0.04312 0.286 0.4975 0.5121 0.7296 0.9633
1 22 0.05932 0.1483 0.495 0.4737 0.7699 0.9371
2 22 0.02742 0.1588 0.4045 0.4325 0.6682 0.9878
3 22 0.009233 0.2552 0.5398 0.5147 0.7755 0.9808
4 22 0.06727 0.2432 0.5019 0.4855 0.7121 0.9043
5 22 0.001032 0.1646 0.3021 0.3727 0.5704 0.9556
hash of array input
$test_data = summary(
{
A => runif(9),
B => runif(9)
},
);
t_test
There are 1-sample and 2-sample t-tests, from one or two arrays:
my $t_test = t_test( $array1, mu => 0.2334 );
or 2-sample:
$t_test = t_test(
$array1, $array2,
paired => 1
);
returns a hash reference, which looks like:
conf_int => [
-0.06672889, 0.25672889
],
df => 5,
estimate => 0.095,
p_value => 0.19143688433660,
statistic => 1.50996688705414
the two groups compared can be specified, though not necessarily, as x and y, just like in R:
$t_test = t_test(
'x' => $array1, 'y' => $array2,
paired => 1
);
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
x |
Array Reference | Required | The first vector of data. Must contain at least 2 elements. |
y |
Array Reference | undef |
The second vector of data. Required for two-sample or paired tests. |
mu |
Float | 0.0 | The true value of the mean (or difference in means) for the null hypothesis. |
paired |
Boolean | FALSE |
If true, performs a paired t-test. x and y must be the same length. |
var_equal |
Boolean | FALSE |
If true, assumes equal variances (standard two-sample). If false, performs Welch's t-test with unequal variances. |
conf_level |
Float | 0.95 | Confidence level for the returned confidence interval. Must be between 0 and 1. |
alternative |
String | "two.sided" |
Direction of the alternative hypothesis: "two.sided", "less", or "greater". |
Return Hash
| Key | Description |
|---|---|
statistic |
The computed t-statistic. |
df |
Degrees of freedom for the test. |
p_value |
The calculated p-value based on the test directionality. |
conf_int |
An Array Reference containing two elements: [lower_bound, upper_bound]. |
estimate |
The estimated mean of x (one-sample) OR the mean of the differences (paired). |
estimate_x |
The estimated mean of the x vector (only returned in two-sample tests). |
estimate_y |
The estimated mean of the y vector (only returned in two-sample tests). |
transpose
Transposes a two-dimensional data structure, swapping rows and columns. Accepts either an array of arrays or a hash of hashes. Returns a new reference of the same type; the input is never modified.
Array of array input
Takes a reference to an array of array references and returns a new AoA where output[j][i] = input[i][j].
my $matrix = [[1, 2, 3], [4, 5, 6]];
my $t = transpose($matrix);
# [[1, 4],
# [2, 5],
# [3, 6]]
All rows must be the same length; a ragged input is a fatal error. undef is valid as an element value and is preserved exactly. An empty outer array or an array of empty rows both return [].
Dies if: - any inner element is not an array reference - rows differ in length (ragged array)
Hash of hash input
Takes a reference to a hash of hash references and returns a new HoH where output{col}{row} = input{row}{col}.
my $table = { alice => { score => 97, grade => 'A' }, bob => { score => 84, grade => 'B' } };
my $t = transpose($table);
# { score => { alice => 97, bob => 84 },
# grade => { alice => 'A', bob => 'B' } }
Inner keys do not need to be uniform across rows. If a given column key appears in only some rows, the output hash for that column will simply contain only those rows — no padding or undef-filling is performed.
my $sparse = {
a => { x => 1, y => 2 },
b => { x => 3, z => 4 } };
my $t = transpose($sparse);
# { x => { a => 1, b => 3 },
# y => { a => 2 },
# z => { b => 4 } }
An empty outer hash or an outer hash whose inner hashes are all empty both return {}.
Dies if any inner element is not a hash reference
uniq
Returns the distinct values of its arguments, in first-seen order.
use Stats::LikeR;
my @u = uniq(1, 2, 2, 3, 1); # (1, 2, 3)
my @s = uniq(qw/a b a c/); # ('a', 'b', 'c')
my @f = uniq(1, [2, 2, 3], [3, 4]); # (1, 2, 3, 4)
my $n = uniq(1, 2, 2, 3, 1); # 3
uniq accepts a flat list of scalars, array references, or any mix of the two. Array references are expanded one level — their elements are treated as additional arguments, but nested array references are not recursed into and are compared as opaque values.
Values are compared by stringification, the same eq semantics used by List::Util::uniq: 1, 1.0, and "1" all collapse to a single result, and the first value seen is the one returned (as a fresh copy, never an alias to the input). Order of first appearance is preserved.
In list context uniq returns the distinct values. In scalar context it returns the count of distinct values, matching List::Util::uniq.
The UTF-8 flag is part of the comparison key, so a UTF-8 string and a byte-identical non-UTF-8 string are kept distinct — they are different strings. Strings that are logically equal and consistently encoded collapse as expected.
Unlike List::Util::uniq, which passes a single undef through, uniq croaks on any undefined value, reporting the offending argument index (and the array-ref index, when the undef came from inside a reference):
uniq(1, undef, 3); # croaks: undefined value at argument index 1
uniq([1, undef, 3]); # croaks: undefined value at array ref index 1 (argument 0)
This matches the undef-handling of mean and the other functions in Stats::LikeR.
vals
Extract a single column from a data frame as a flat array reference, similar to pandas' to_list
my $ages = vals($df, 'age');
vals accepts all three data-frame shapes and always returns a new arrayref of that column's values:
AoH (array of hashes) -- one value per row, in row order.
HoA (hash of arrays) -- the named column array, copied.
HoH (hash of hashes) -- one value per row, in ascending key order (a HoH has no inherent row order, so keys are sorted as strings).
Arguments
| Position | Name | Description |
|---|---|---|
| 1 | $df |
An AoH (arrayref), or a HoA/HoH (hashref). The shape is auto-detected by peeking the first hash value: a hashref value means HoH, otherwise HoA. |
| 2 | $col |
The column name (must be defined). |
Behavior and notes
The result is a copy. Every value is duplicated, so mutating the returned array never touches
$df, andundefslots are ordinary writable scalars.A missing cell is
undef. For AoH and HoH, a row that lacks the column (or isn't a hashref) yieldsundeffor that row.An absent column is strict only for HoA. Because a HoA column is the structure, asking for a column the hash doesn't have dies. For AoH/HoH the column is per-row, so an entirely-absent column simply yields all-
undef(it is not an error). This asymmetry is deliberate; pass the column name carefully for AoH/HoH, since a typo returnsundefs rather than dying.Empty frames return
[]-- an empty AoH or an empty hash both give a clean empty arrayref.UTF-8 column names and HoH keys are handled correctly (lookups use the key SV; HoH keys sort by Perl string order).
Examples
my $aoh = read_table('patients.csv'); # array of hashes
my $age = vals($aoh, 'Age'); # [ 34, 51, ... ]
my $hoa = read_table('patients.csv', 'output.type' => 'hoa');
my $sex = vals($hoa, 'Sex'); # copy of the Sex column
my $hoh = read_table('patients.csv', 'output.type' => 'hoh');
my $age2 = vals($hoh, 'Age'); # values in sorted row-key order
# feed straight into the numeric routines
my $m = mean( vals($aoh, 'Age') );
value_counts
Count the values in a given data set, return a hash reference showing how many times each particular value is present.
Scalar
$hash = value_counts('c');
returns { c => 1 }
Array reference
value_counts(['a','b','b']);
returns { a => 1, b => 2}
Array
my $value_counts = value_counts('a','b','b');
like an array reference above, returns { a => 1, b => 2}
Array of hashes
my @records = (
{ name => 'Alice', dept => 'Sales' },
{ name => 'Bob', dept => 'Eng' },
{ name => 'Carol', dept => 'Sales' },
);
my $vc = value_counts(\@records, 'dept');
with a key, the value at that key is counted in each hash, so the above returns { Sales => 2, Eng => 1 }. A record that lacks the key is skipped. Passing an array of hashes without a key, or with an element that is not a hash reference, is a fatal error.
Array of arrays
my @rows = (['a', 1], ['b', 1], ['a', 2]);
my $vc = value_counts(\@rows, 0);
when the elements are array references, the key is treated as a numeric column index, so the above returns { a => 2, b => 1 }. A non-numeric index against array-reference elements is a fatal error.
Hash
my $value_counts = value_counts( { A => 'a', B => 'a', C => 'b' } );
returns { a => 2, b => 1}
Hash of array
my $value_counts = value_counts({ 'a' => ['j', 't', 't'], 'b' => ['j', 't', 'v']});
without a key (like above), the occurences of j, t, and v are counted. With a key, like a for above, only values within that hash key are counted:
my $vc = value_counts({ 'a' => ['j', 't', 't'], 'b' => ['j', 't', 'v']}, 'a');
Hash of hash (table)
$hash = value_counts( {
A => {
a => 'x',
b => 'z'
},
B => {
a => 'x'
},
C => {
a => 'y'
}
}, 'a');
the column, or second hash key, that you wish to count, is specified at the command line
The two new subsections (Array of hashes, Array of arrays) are the only additions; everything else is unchanged. They're placed after the array-container forms to keep array inputs grouped, mirroring how Hash of array / Hash of hash sit together. If you'd rather I drop this into a .md file or fold it into POD (=head3 headers, C<> for the inline code) for the actual module docs, say the word.
var
as simple as possible:
var(2, 4, 5, 8, 9)
as of version 0.02, var will die if any undefined values are provided
like min, max, etc., var can accept array references, to make code simpler:
my $ref = \@arr;
var($ref) = var(@arr)
var_test
As described by R: Performs an F test to compare the variances of two samples from normal populations
use Stats::LikeR;
my @x = (2.9, 3.0, 2.5, 2.6, 3.2);
my @y = (3.8, 2.7, 4.0, 2.4);
my $vt = var_test(\@x, \@y);
also, conf_level can be set:
$vt = var_test(\@x, \@y, conf_level => 0.99);
as well as a ratio (from R: the hypothesized ratio of the population variances of x and y:
$test_data = var_test(\@xk, \@yk, ratio => 2);
view
An R-style head for the structures read_table returns. Prints the first few rows of a dataframe as an aligned text table, with numeric columns right-justified, string columns left-justified, and undefined cells shown as NA. Works on all three output.type values:
| `output.type` | Perl structure | What `view` shows |
|---|---|---|
aoh |
array of hash refs | one line per row, sequential row numbers |
hoa |
hash of array refs | values gathered column-wise by row index |
hoh |
hash of hash refs | top-level keys become the row label column |
Synopsis
my $aoh = read_table('all.data.tsv', 'output.type' => 'aoh');
view($aoh); # first 6 rows, like head()
view($aoh, n => 20); # first 20 rows
view($aoh, cols => [qw(id age tt)]); # force a column order
view($aoh, 'row.names' => 'id'); # use column 'id' as the row label
view($aoh, na => '.', max_width => 30);
my $txt = view($aoh, return_only => 1); # capture the string, print nothing
view($aoh, to => \*STDERR); # print somewhere other than STDOUT
Output
# AoH: 7 rows x 3 cols (showing 6)
row_name Testosterone, total (nmol/L) age sex
p1 18.2 41 M
p2 NA 7 F
p3 1.05 33 F
p4 22.9 55 M
p5 14 29 M
p6 NA 62 F
# ... 1 more row
The banner reports the structure type, full dimensions, and how many rows are displayed. A footer appears only when rows are hidden.
Arguments
All arguments after the data reference are optional name/value pairs.
| Argument | Default | Meaning |
|---|---|---|
n |
6 |
Number of rows to show. n greater than the table shows everything. |
rows |
6 |
Number of rows to show. n greater than the table shows everything (synonymous with n) |
cols / columns |
â | Array ref pinning column order (and which columns appear). |
row.names |
â | Column to use as the row label (for aoh/hoa). See ordering note. |
na |
'NA' |
Token printed for undefined cells. |
max_width |
80 |
Truncate any cell wider than this (column names are never truncated). |
ellipsis |
'...' |
Marker appended to truncated cells. |
gap |
2 |
Spaces between columns. |
to |
STDOUT | Filehandle to print to. |
return_only |
0 |
If true, return the string and print nothing. |
view always returns the formatted string, whether or not it also prints.
A note on column order
read_table stores rows as hashes, so the original CSV column order is not preserved. view therefore sorts columns by name for a stable, reproducible layout. Two conveniences soften this:
A column literally named
row_name(the labelread_tableassigns to a leading blank header) is detected automatically and moved to the left as the row label.Pass
cols => [ ... ]to control both the order and the selection of columns shown.
When no label column is present, view numbers the rows 1, 2, 3, …, the way R prints row names for an unnamed data frame.
Edge cases
Empty input (
[]or{}) prints a clean0 rows x 0 colsbanner.Tabs, carriage returns, and newlines inside a cell are escaped (
\t,\r,\n) so one record always stays on one line.A non-reference argument, or a hash whose values are plain scalars, dies with a clear message rather than producing garbled output.
Tests
The behavior above is covered by view.t (run with prove view.t): the three structure types, n boundaries, alignment, NA rendering, truncation, row.names/cols handling, control-character escaping, the return_only and to output paths, empty structures, and the error cases.
wilcox_test
$test_data = wilcox_test(
[1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30],
[0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29]
);
Computes the Wilcoxon rank-sum / Mann-Whitney test (two samples) or the Wilcoxon signed-rank test (one sample or paired), following R's wilcox.test conventions. This is an alternative to the t-test, that does not assume a normal distribution. With two array refs and no paired flag it runs the two-sample rank-sum test; with a single sample, or with paired => 1, it runs the signed-rank test. It calculates exact p-values by default for N < 50 without ties; when ties (or, for the signed-rank case, zero differences) are present it automatically switches to the normal approximation with continuity correction.
Calling conventions
The first one or two array-ref arguments are taken positionally as x and y; everything after that is parsed as key => value pairs. The named forms x => and y => are also accepted and override the positional values. The flat argument list following the positional refs must contain an even number of elements, or the call dies with a usage message.
# positional
wilcox_test(\@x, \@y, paired => 1);
# fully named
wilcox_test(x => \@x, y => \@y, alternative => "greater", exact => 0);
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
x |
ARRAY ref | (required) | The first sample. Passed positionally or as x =>. Non-numeric and undefined elements are silently dropped; an empty or all-missing x is fatal. In the two-sample test mu is subtracted from each x value. |
y |
ARRAY ref | undef |
The second sample. If present and paired is false, a two-sample rank-sum test is run. If paired is true, y is required and must be the same length as x. Omit it for the one-sample signed-rank test. |
paired |
boolean | 0 (false) |
Run a paired signed-rank test on the per-element differences x[i] - y[i] - mu. Requires y of equal length. |
correct |
boolean | 1 (true) |
Apply the continuity correction (±0.5) when using the normal approximation. Ignored when an exact p-value is computed. |
mu |
number | 0.0 |
Null-hypothesis location shift. Subtracted from x (two-sample) or from each difference (one-sample / paired). |
exact |
boolean / undef | undef (auto) |
Tri-state. undef (or absent) selects exact automatically: when both group sizes are < 50 and there are no ties (two-sample), or n < 50 with no ties (signed-rank). A true value forces the exact test, a false value forces the approximation. Exact is impossible with ties â or, for the signed-rank test, with zero differences â and falls back to the approximation with a warning. |
alternative |
string | "two.sided" |
One of "two.sided", "less", or "greater". Selects the tail(s) used for the p-value. |
Output
Returns a hash ref with the following keys:
| Key | Type | Description |
|---|---|---|
statistic |
number | The test statistic. For the two-sample test this is the Mann-Whitney W (the x rank sum minus nx*(nx+1)/2). For the signed-rank test it is V, the sum of the ranks assigned to the positive differences. |
p_value |
number | The p-value for the chosen alternative, capped at 1.0. Two-sided p-values are 2 * min(p_less, p_greater). |
method |
string | A human-readable description of the exact test variant that was run (see below). |
alternative |
string | Echoes the alternative actually used ("two.sided", "less", or "greater"). |
The method string reports which path executed:
Two-sample:
"Wilcoxon rank sum exact test","Wilcoxon rank sum test with continuity correction", or"Wilcoxon rank sum test".One-sample / paired:
"Wilcoxon exact signed rank test","Wilcoxon signed rank test with continuity correction", or"Wilcoxon signed rank test".
Notes and edge cases
Missing data is handled by listwise removal of non-numeric / undefined cells before ranking; in the paired case a pair is dropped if either member is missing. An empty x (or, in the two-sample case, an empty y) after this filtering is fatal.
For the signed-rank test, exact zero differences are discarded before ranking (matching R), and their presence disables the exact computation. Both empty-after-filtering and all-zero-difference inputs are fatal.
Ties are detected during ranking and trigger the tie-corrected variance in the normal approximation; they also rule out the exact p-value. When exact is left on auto, the size thresholds (< 50 per group, or < 50 differences) are what gate the exact vs. approximate decision.
write_table
mimics R's write.table, with data as first argument to subroutine, and output file as second
write_table(\@data_aoh, $tmp_file, sep => "\t", 'row.names' => 1);
You can also precisely filter and reorder which columns are written by passing an array reference to col.names:
write_table(\@data, $tmp_file, sep => "\t", 'col.names' => ['c', 'a']);
undefined variables are printed as NA by default, but can be set as you wish using undef.val
write_table(\%data_hoa, '/tmp/undef.val.tsv', sep => "\t", 'undef.val' => 'nan')
as of version 0.07, write_table determines comma and tab-separated delimiters from the filename, but will override if sep or delim are explicitly set.
Args can also be accepted:
write_table( 'data' => \%flat, 'file' => $f );
Changes
0.19
numerous SSize_t var1 = av_len(var) + 1 are changed to size_t var1 = av_len(var) + 1 as size_t; as the result cannot be negative, in order to expand numerical range
Addition of hoa2hoh, binom_test, chunk, get_union, get_unique, Lonly, Ronly, qcut, and 3 tukey functions
Better warnings when non-array references are given to intersection
view now breaks columns into chunks for very wide data sets, more closely matching R's behavior
0.18
restrict keyword added to numerous places within intersection to decrease CPU time
fix to dist.ini for dependencies
fixed POD rendering
0.17
addition of assign, which adds new columns based on calculations from other columns
addition of hoa2aoh, transforming hash of arrays to array of hashes
addition of predict, using results from aov, glm, and lm
addition of aoh2hoh transforming array of hash into hash of hashes, intersection, uniq, and vals
aov
Bug fixes
size_tunderflow on empty arrays. Three loops were bounded byav_len(...)compared against an unsigned counter;av_lenreturns-1for an empty array, which turnedk <= leninto aSIZE_MAXloop. Thestack()value loop, the.column-expansion loop, and thegroup_statscolumn loop now use a signedSSize_tbound.HoH row count. Row count for hash-of-hashes input was taken from the return value of
hv_iterinit; it now usesHvUSEDKEYS(hv)with a separatehv_iterinit, matchingpredict.Buffer overflow in interaction parsing.
strcpy(right, colon + 1)into a fixedchar right[256]is nowsnprintf(right, sizeof(right), ...).
Performance / memory
Removed the per-row
row_xscratch allocation. Design rows are built directly intoX_mat[valid_n];valid_nsimply does not advance on a rejected row. Interaction columns read their operands from the same in-progress row, so the logic is unchanged.row_namesis no longer dead. Surviving row names are transferred (pointer move, no copy) intosurv_namesto keyfitted.values; rejected rows are freed in place.Dropped a
restrictUB.orig_data_svaliasesdata_sv; therestrictqualifier was removed.
New, predict-compatible output keys
coefficients— OLS estimates recovered by back-substitution on the R factor left inX_matagainst Q'y inY(no re-derivation). Keys are the expanded term names (Intercept, continuous names,base.leveldummies, anda:binteraction products). Aliased columns are reported asNaN, whichpredictdrops.fitted.values—Xbover the non-aliased columns, keyed by surviving row name. Computed from a snapshot of the design (Dsav) taken before the QR overwritesX_mat. Costs one transient copy of the design matrix; negligible for typical ANOVA where the column count is small.xlevels— sorted level list per factor, index 0 = reference, aligned with the contrast coding used to build the dummies.family—"gaussian".
Cleanup-path correctness
xlevels_hv,Dsav, andsurv_namesare freed on both the "0 degrees of freedom" croak and the normal exit. The interaction-main-effects croak in PHASE 3 also freesxlevels_hv.
Known limitations (unchanged)
The intercept-stripping string surgery (
-1,+0,+1, ...) operates on the whole RHS and can still mangleI(x-1)-style transforms; treatI()with arithmetic constants carefully.Top-level keys
coefficients/fitted.values/xlevels/family/group_statsshare the return hash with the ANOVA rows; a predictor literally named one of those would collide.
predict
New: factor-bearing interaction terms
Previously, interaction coefficients such as GroupB:Sexmale or GroupB:x fell through to the continuous evaluate_term path and died on a nonexistent column. They are now handled directly:
dummy_hvstores each dummy's factor base index (anIV) instead of&PL_sv_yes, so a dummy name maps back to its(base, level)in O(1) (level == name + strlen(base)).hv_existslookups are unaffected.During coefficient caching, any
:term with at least one factor-dummy component is routed to a separate list (icopy/ibeta); pure-continuous interactions (e.g.x:z) stay on the existingevaluate_termpath, so prior behavior is preserved.Each routed term is parsed once into flat component arrays. Factor components store a base index and level pointer; continuous components store the term string and get the same up-front column-existence validation as main terms.
Per row, each factor's raw level is read once into
raw_lv[]and reused by both main effects and interactions (no duplicateget_data_string_alloc). An interaction's value is the product of its components: a factor component contributes1.0iff the row's level matches the dummy's level (reference levels give0), continuous components go throughevaluate_term.
This covers factor×factor, factor×continuous, continuous×continuous, and n-way combinations.
Other
HoH row count uses
HvUSEDKEYS(already present).The unseen-factor-level croak now frees every level string already read for the current row, not just the current one.
Tests
aov.t— one-way ANOVA against hand-computed values (Df / Sum Sq / Mean Sq / F / decomposition); identical results across HoA / HoH / AoH / stacked input; simple regression;.expansion; intercept removal (-1); two-way with interaction (Type I SS on a balanced design); NaN listwise deletion; all croak paths; leak checks.predict.t—predict(training) == fitted.valuesround-trips for one-way, regression, factor×factor, factor×continuous, and continuous×continuous models; explicit predicted values; agreement across HoA / AoH / HoH / flat newdata; no-newdata path; binomiallinkvsresponse; gaussian identity link; all croak paths; leak checks.
Leak tests use no_leaks_ok guarded by unless $INC{'Devel/Cover.pm'} and skipped when Test::LeakTrace is absent.
Assumptions worth confirming
The NaN-deletion test relies on
evaluate_termreturningNaNfor a non-finite response value (anInf - InfNaN is fed in deterministically).The continuous×continuous round-trip relies on
evaluate_term("x:z")yieldingx * z— the same assumption the pre-existingpredictcontinuous-interaction path already made. If that path was untested, this round-trip now exercises it.
view
now returns colored output; fixed bug with incorrect widths; undefined values show as undef rather than NA, as in Data::Printer
csort
now accepts Hash of Hashes; addition of restrict which should decrease calculation time
filter
Added hash-of-hashes (HoH) input. In addition to AoH and HoA,
filternow accepts an HoH ({ key => { col => val, ... }, ... }); each inner hash is one row, and matching keys are preserved by default (HoH -> HoH).Added
output.type.filter($df, $pred, 'output.type' => 'aoh'|'hoa')selects the returned shape (aliasesout/output_type; a bare positional type also works). When omitted, the input shape is preserved.hohis not a selectable output, since it would require choosing a key column.col()reworked, not removed. Both predicate forms are kept:col('age') >= 18still works and is the concise/composable option, while a coderef covers everything else. Internallycol()is now pure Perl — an overloaded class that builds a per-row closure — andfilterunwraps that closure socol()and a coderef share one evaluation path. The previous standalone XS predicate evaluator (filt_eval/filt_ctx) is gone; delete it if your tree still has it. One consequence: acol()comparison now costs the same per row as the equivalent coderef (a Perl call), rather than being evaluated in C.Unchanged guarantees: the input frame is never modified;
undef(and, for numeric ops, non-numeric) cells never match acol()comparison; AoH/HoH rows are shared rather than copied where possible; keep-all/keep-none shapes are well defined per output type; Perl 5.10 compatibility is retained. A latentSvTRUE(POPs)double-evaluation in the per-row call helper (which crashed on perls whereSvTRUEis a multi-eval macro) was fixed along the way.
read_table
Added an opt-in auto.row.names argument so read_table can read the file R produces by default from write.table(x, sep="\t").
The problem
R's write.table defaults to row.names=TRUE, col.names=TRUE, which writes the row-names column in every data row but emits no header label for it. So a frame with N columns comes out as N header fields over N+1 data fields — e.g. mtcars gives 11 headers but 12-field rows. By default read_table (correctly) rejects that as ragged:
Alignment error on mtcars.tsv data row 1 (12 fields vs 11 headers).
The change
auto.row.names turns on R's own read.table rule: when, and only when, the header is exactly one field short of the data rows, treat the first field of each row as an (unlabelled) row-names column.
# default: the leading column is named 'row_name'
my $df = read_table('mtcars.tsv', 'auto.row.names' => 1);
# or give it a name
my $df = read_table('mtcars.tsv', 'auto.row.names' => 'model');
The synthesized column behaves like any other first column: it appears in aoh and hoa output, and for hoh it becomes the default key (so rows are keyed by the model name). This also lines up with the existing handling of R's col.names=NA output (a blank leading header), which still produces a row_name column with no flag needed.
What did not change
The strict alignment check is still the default. Without auto.row.names the lopsided file still croaks, and even with it, a row that is off by anything other than exactly one field still croaks — so the corruption guard only relaxes for the one case R itself treats specially.
Tested in t/read_table.2.t (16 assertions, Perl 5.10.1 and 5.38): aoh / hoa / hoh output, custom column name, the already-aligned file (flag is a no-op), the col.names=NA path, and the strict / ragged croak paths.
additional bugfix
# This is a comment
id,name,val
1,Alice,10.5
2,Bob,
3,Charlie,15.2
would not be read correctly using read_table, but now is read correctly
value_counts
now accepts array of hashes
0.16
changes to dist.ini, the minimum Perl version disappeared when I fixed other problems
clarifications between run time and test dependencies
addition of csort function to sort AoH and HoA
addition of aoh2hoa to translate array of hashes into a hash of arrays
fix of long double functions: https://www.cpantesters.org/cpan/report/5d5d9836-6a5f-11f1-aadb-63fd6d8775ea
glm
output residual keys now use names, not integers
lm
Bug fixes
Memory leak on the zero-degrees-of-freedom error path. When valid_n <= p, the cleanup freed the valid_row_names array but not the per-row name strings it held (those had been transferred out of row_names, whose own array was already freed). The strings leaked on every such error. Added the per-entry Safefree loop before freeing the array, matching the normal path.
HoH input validated only the first row. Only the first hash value was checked to be a HASHREF; subsequent values were SvRV'd unconditionally, so a malformed row ({ a => {...}, b => 5 }) dereferenced a non-reference. Every row is now validated, with the partial allocations cleaned up before the croak, mirroring the existing AoH path.
isspace on a possibly-signed char. isspace(*src) is undefined for byte values ≥ 0x80 on platforms where char is signed. Cast to (unsigned char) before the call.
Speed / RAM improvements
Formula buffer is now heap-allocated to fit. char f_cpy[512] silently truncated any longer formula. Replaced with a buffer sized to strlen(formula) + 1, so there is no fixed limit and no truncation.
.-expansion buffer is now a growable heap buffer. char rhs_expanded[2048] silently dropped expanded terms once full. It is now a buffer that doubles on demand. Appends also went from strcat (which rescans from the start every time — O(n²) over many columns) to an O(1) amortised append that tracks the write position.
No more per-row scratch allocation in matrix construction. The original safemalloc'd a row_x buffer, filled it, copied it into X, and freed it for every row — n allocations plus n*p copies. Each candidate row is now written straight into X at its prospective commit slot; a row that fails listwise deletion is simply overwritten by the next candidate. This removes the n allocate/free cycles and the copy loop entirely.
Categorical levels sorted with qsort. The level list used an O(n²) bubble sort; replaced with qsort (relevant only for high-cardinality factors).
Unused tail of X reclaimed after listwise deletion. X is allocated for all n rows up front (valid_n is unknown until rows are scanned). When rows are dropped, X is now Renewed down to valid_n * p, returning the unused tail to the allocator before the OLS phase.
Minor robustness. The argument-parsing index was widened from unsigned short to I32 to match items, and the HoH row count now uses HvUSEDKEYS rather than relying on hv_iterinit's return value.
Known limitations (left unchanged)
A multi-way term such as
a*b*cis split only on the first*, so it yieldsa,b*c, anda:b*crather than a full three-way expansion. Deeper interactions silently fail (the unparsable term evaluates toNaNand the rows are dropped). This matches the documented two-way*support.HoA input takes the row count from the first column; columns shorter than that simply contribute dropped rows rather than raising an error.
oneway_test
Bug fixes
Memory leaks on error paths. Nearly every croak after an allocation leaked memory. croak does a longjmp, so anything allocated but not yet freed is lost. Affected paths:
AoA and hash first-pass errors leaked
sizesand anygnames[]entries allocated so far.Formula-mode "not found as an array ref" errors leaked
lhsandrhs.
All post-allocation errors now route through a single fail: label that frees every pointer unconditionally. Pointers are initialised to NULL and gnames is zero-allocated with Newxz, so the cleanup is always safe to run.
Undefined and non-numeric cells silently coerced to 0.0. The original second pass used (svp && *svp) ? SvNV(*svp) : 0.0, meaning an undef or non-numeric cell was quietly treated as zero, silently corrupting the F-statistic. Each cell is now validated with SvOK and looks_like_number; the call dies naming the group and observation index, consistent with the rest of Stats::LikeR (mean, sum, cor, etc.).
Unsigned wraparound on empty array input. k = (size_t)av_len(in_av) + 1 cast to size_t before adding, so an empty array (av_len returns -1) produced SIZE_MAX rather than 0. Changed to k = (size_t)(av_len(in_av) + 1) so the +1 is done in signed arithmetic before the cast.
Unreliable group count from hv_iterinit. hv_iterinit returns the number of buckets in use rather than the number of keys for tied hashes. Replaced with HvUSEDKEYS, which always returns the correct key count.
Improvements
var.equal accepted as an alias for var_equal. R users write var.equal; the argument parser now accepts both spellings.
Perl memory API used throughout. safemalloc and manual memcpy replaced with Newx, Newxz, savepv, and savepvn. savepvn additionally preserves embedded NUL bytes in group key strings, which the previous strlen-based copies silently truncated.
Known limitations (not changed)
A factor column named
Residualsorgroup_statsin a formula call will collide with reserved top-level keys in the result hash.Group names containing an embedded NUL are stored correctly but are still truncated at
strlenwhen written into the output hash keys.
view
default view shifted to 80 characters to match Linux window length
New features
rowsis accepted as a synonym forn(the number of rows shown). Passing bothnandrowsis an error.Unknown arguments are now rejected.
viewvalidates its argument names against the documented set (n,rows,na,max_width,ellipsis,gap,cols,columns,to,return_only,row.names,row_names) and dies listing any it does not recognise, so a misspelt option (e.g.widht) is caught instead of silently ignored.n/rowsis validated. It must be a non-negative integer;undefor a non-numeric value now dies with a clear message instead of producing warnings and being treated as0.flat/simple hashes are accepted as input
Bug fixes
n => 0now still prints the column header. Column names were collected only from the rows being shown, so requesting zero rows produced an empty header line. At least one row is now scanned (when data exists) so the header always lists the columns.An empty hash (
{}) no longer dies. It was rejected as "neither ARRAY nor HASH"; it is now shown as an empty table (0 rows x 0 cols), matching the handling of an empty array.The
row_namesalias now drives the Hash-of-Hashes label header. The header for the row-label column consulted onlyrow.names, sorow_names => 'id'displayedrow_nameinstead ofid. Both spellings are now honoured consistently.Malformed nested values degrade gracefully. A Hash-of-Arrays column or Hash-of-Hashes row whose value is not actually an array/hash reference now renders as empty cells rather than throwing a dereference error.
Performance
Column gathering no longer sorts once per scanned row. Unique column names are collected across the scanned rows and sorted a single time (same output order), and the ellipsis length is computed once rather than per cell.
Tests
t/view.tis self-contained (theviewimplementation is inlined; it loads no other files) and covers the new argument handling, the bug fixes above, and the existing AoH / HoA / HoH behaviour, alignment, truncation, and output-path handling.
wilcox_test
Corrected four bugs in the wilcox_test XSUB plus a portability fix in its exact signed-rank helper. Behaviour on valid input is unchanged: the R-agreement cases (unpaired W = 58, p = 0.13292; paired one-sided V = 40, p = 0.019531; separated exact W = 0, p = 0.028571) all still match R's wilcox.test.
Bug fixes
Invalid
alternativeis now rejected. Any value other thanlessorgreaterpreviously fell through to the two-sided branch and returned a two-sided result mislabelled with the bad string, so a typo likealternative => "twosided"silently "worked". It now croaks unlessalternativeis one oftwo.sided,less,greater.Zero/negative variance is guarded. When every observation is tied the approximation's variance collapses to 0 and the old code divided by
sqrt(0):wilcox_test([5,5,5], [5,5,5])returnedp = 0(a "significant" difference between identical samples). It now warns and returnsp = 1.Two-sided continuity correction at
z = 0. R usessign(z) * 0.5, so the correction is0when the statistic sits exactly on its mean; the old code used-0.5. Example:wilcox_test([1,4], [2,3], exact => 0)changed fromp = 0.698535top = 1(matches R).expno longer shadows libm. The localexpaccumulator (mean of the statistic) shadowed the C libraryexp(); renamed tomean_w(two-sample) andmean_v(signed-rank). No active miscompute, removed as a latent hazard.
Cosmetic
Collapsed a no-op ternary that assigned the same signed-rank exact method string on both branches; the
methodfield is now simplyWilcoxon signed rank exact test.
Portability (exact signed-rank helper)
exact_psignrankno longer callspowl(). The2^nnormaliser is now built by exact repeated doubling, which has no long-double libm dependency. This fixes anUndefined symbol "powl"load failure reported by a CPAN smoker (FreeBSD, perl 5.20,nvtype=double) whose libm lacks the long-double math functions; the symbol resolved on glibc, which is why local builds passed.long doubleaccumulation in the DP is retained — only thepowlcall was at fault.int→size_tforn,max_v, and the DP loop counters, which also removes asize_t-to-intnarrowing at the call site. Thefloor()result (k) stays signed so its negative-qsentinel still fires, and is cast tosize_tonly after thek < 0check.
Tests
Added
t/wilcox_test.t(flat, no subtests): R-agreement cases, option handling (paired,correct,exact,mu, named/positionalx/y, NA dropping), regressions for all four bug fixes, argument-error andalternative-validation checks, output shape, andno_leaks_okcoverage of the two-sample, exact, and paired allocation paths.
0.15
view function added, similar to R's head
read_table: filter => { 'Testosterone, total (nmol/L)' => sub { defined $_ }, }
was broken by the change in undefined variables in 0.14, but is back to being undef
col2col improvement in sectioning in README
Numerous changes to prevent quadmath/long double CPAN test failures
Minimum Scalar::Util version in dist.ini is now 1.22, see https://www.cpantesters.org/cpan/report/6b682236-6567-11f1-a3bc-a055f9c4ba34
Digest::SHA is no longer needed, and removed as a dependency
read_table
Bug fixes
A comment-prefixed header is now read correctly.
read_tablestrips a leading comment marker from the header line (so a file may begin with#id,val), but that strip was dead code: the XS parser skipped every line beginning with the comment string before the callback ever saw it, so a commented header was silently dropped and the first data row was mistaken for the header. The parser now delivers the first content line even when it begins with the comment marker, and only skips comment lines after the header has been seen.Carriage returns inside quoted fields are preserved. The parser stripped
\runconditionally, so a quoted value such as"x\ry"lost its carriage return and would not survive awrite_table->read_tableround-trip.\ris now stripped only as part of a trailing CRLF line ending and as a stray CR outside quotes; inside quotes it is literal data.Duplicate column names no longer corrupt
hoaoutput. Withoutput.type => 'hoa', a repeated column name pushed the same cell once per occurrence, so the affected columns came out longer than the others and the arrays no longer lined up by row. Columns are now keyed by unique header name (first-seen order preserved, later values win, one warning emitted).A defined non-CODE callback is now an error. Passing a defined argument that was not a CODE reference silently fell through to slurp mode and ignored the argument; it now croaks ("callback must be a CODE reference").
An undefined/empty
hohrow-name now dies instead of keying on"". Withoutput.type => 'hoh', a row whose row-name column was empty/undef was stored under the''key and raised "uninitialized value" warnings. It now dies, naming the column and the offending data row.A numeric filter key past the last column now dies. A 1-based numeric filter key greater than the column count was accepted, then silently extended every row through the
$_write-back. It is now rejected up front with a message naming the column count.sepanddelimtogether now die. Supplying both silently preferreddelim; passing both is now an explicit error (delimremains an alias forsepwhen used alone).The library no longer prints to STDOUT. The unknown-argument path used
sayto dump the offending names to STDOUT before dying; the names are now carried in thediemessage itself.
Better diagnostics
Alignment errors now report which data row is ragged ("Alignment error on FILE data row N (X fields vs Y headers)"), instead of only the field/header counts.
Memory-leak fixes (exception paths)
The parser allocated its working buffers (current_row, field, and — in slurp mode — data) in the XS INIT: block, i.e. before any validation, and freed them only by falling off the end of the function. Any non-local exit therefore leaked:
the open-failure
croakleaked the row buffer and field (and the slurp accumulator);far more commonly, a
diethrown inside the row callback — whichread_tabledoes routinely on alignment errors, bad row names, and filter exceptions — unwound straight out of the XS frame and leaked the field, the current row, the line buffer, the slurp accumulator, and the open file handle.
Allocations now happen in CODE: after every croak-able check, and every long-lived resource (the file handle via SAVEDESTRUCTOR_X, the buffers via SAVEFREESV) is tied to the save stack, which an exception unwinds. Measured with Test::LeakTrace: a die mid-file went from 5 leaked SVs to 0, and an open failure from 2 to 0. This is the likely source of the constant-size leaks seen in CPAN-tester reports for the exception-path tests.
Performance
~2.5x faster parsing (57 -> 145 MB/s on a 100k-row quoted file). The core loop appended one character at a time with
sv_catpvn(field, &ch, 1); it now scans runs of ordinary bytes withmemchr/ a bounded scan and appends each run in a singlesv_catpvn, copying field contents in bulk rather than byte by byte.
Internal / non-behavioral
XS declarations moved from
INIT:toPREINIT:; allocations deferred intoCODE:(see the leak fixes above).The filter loop now aliases the row hash with
local *_ = \%line_hashinstead of copying it withlocal %_ = %line_hash. This removes a full per-row hash copy for every filtered row and fixes a latent staleness bug: after a filter mutated$_and the change was written back,%_still reflected the pre-mutation copy, so a subsequent filter in the same row saw stale values. With aliasing,%_is the row, so write-backs are always visible.
Known limitation (not changed)
undef.valdoes not round-trip back toundef.write_tablerenders anundefcell as an empty field by default, andread_tablemaps an empty field back toundef, so the default round-trip is clean. But if a file is written with a token such as'undef.val' => 'NA',read_tablehas no inverse option and readsNAback as the string'NA'.read_tablealso cannot distinguish a deliberately quoted empty string ("") from a missing value -- both becomeundef. Adding anna.strings-style option toread_table(mapping configurable tokens and/or empty fields toundef) would close this gap.
write_table
Behavior change
undefcells now write as an empty field, not an empty string. A missing orundefvalue renders as nothing between separators (a,,c) rather than a quoted empty string (a,'',c/a,"",c). Supplying'undef.val' => 'NA'(or any other token) still overrides this, exactly as before. This is the only change that can alter the bytes of an existing output file; if you relied on the previous default, pass'undef.val' => ''to keep an explicit empty field, or your chosen placeholder.
Bug fixes
Wide-character / UTF-8 column names and row keys now round-trip. Previously, cells were looked up with the raw bytes of the column name (
hv_fetch(..., SvPV_nolen(name), strlen(name), ...)), which fails to match a UTF-8-flagged hash key: the column header printed correctly but every cell under it came back empty. All lookups now fetch by SV (hv_fetch_ent), header lists are gathered and sorted as SVs (sortsv+sv_cmp, preserving the flag) instead of being round-tripped throughchar *, and therow.namescolumn is matched withsv_eqrather thanstrcmp. Embedded NUL bytes in keys are handled correctly as a side effect.col.names => []no longer loops forever. An emptycol.namesarray madeav_len()return-1, which — compared against an unsignedsize_tloop index — wrapped toSIZE_MAXand ran effectively without end. This was fixed for flat hashes previously; it was still present for hash-of-hashes, hash-of-arrays, and array-of-hashes, plus bothrow.namesheader-filtering loops. All such loops now use a signed index.Tables wider than 65,535 columns no longer hang. One header loop used an
unsigned shortindex that silently wrapped past 65,535 and never terminated. It now usessize_tlike the rest of the code.Flat-hash cells holding a reference now croak. Every other input shape rejects a nested reference with "Cannot write nested reference types to table"; a flat hash instead stringified it (e.g.
ARRAY(0x55...)) into the file. It now croaks consistently.'undef.val' => undefis handled cleanly. It previously calledSvPV_nolenonundef, raising an "uninitialized value" warning and yielding an empty string by accident. It is now treated explicitly as an empty field, with no warning.
Memory-leak fixes (exception paths)
The row-key list gathered for hash-of-hashes input was leaked when the output file could not be opened.
The "Could not get headers" croak on hash-of-arrays input leaked both the already-open filehandle and the headers array.
Internal / non-behavioral
Numeric row labels are now formatted into a reused stack buffer instead of a per-row
savepv()/safefree()allocation (no functional change; removes a cast-away-constand one allocation per row).Several signed/unsigned index types were made consistent (
SSize_tvssize_t) to matchav_len()and silence the conditions behind the loop bugs above.
Tests
t/write_table.texpanded from 17 to 69 assertions. New coverage targets each fix above: the empty-field default andundef.val => undef(no warning),col.names => []termination across all four input shapes, the >65,535-column header loop (gated behindEXTENDED_TESTING=1), in-sequence numeric row labels, nested-reference rejection, CSV quoting corners (carriage return, separators inside column names, multi-character separators), empty input writing no file, and UTF-8 column names and row keys. Two leak assertions cover the exception paths above.
0.14
filter function added for rows
read_table reads undefined values to undef instead of NA, which makes calculations easier
write_table writes undef by default as an empty string ''
hoh2hoa transforms a hash of hashes into an hash of arrays
quantile uses NV instead of double to allow for high-precision 128-bit floats to be used on quadmath machines when available: https://www.cpantesters.org/cpan/report/296f4868-631f-11f1-abba-ff15558d240b
Numerous switches from double to NV for local precision, like above
numerous changes to col2col for ease of use and working with datasets with numerous undefined values
dist.ini now links to math library when compiling: https://www.cpantesters.org/cpan/report/785e26d8-6397-11f1-89c0-dc066e8775ea
fisher_test now should be complete, errors with confidence intervals fixed
0.13
read_table: speed improvements; commented headers are now allowed
write_table: fix for
Attempt to free temp prematurely: SV 0x56417a2ae610 at t/write_table.t line 182.
main::wrote_ok(",age\x{a}Alice,30\x{a}Bob,25\x{a}", "row.names => 'name' uses that column as labels", HASH(0x56417a272250), "row.names", "name") called at t/write_table.t line 203
Attempt to free unreferenced scalar: SV 0x56417a2ae610 at t/write_table.t line 183.
main::wrote_ok(",age\x{a}Alice,30\x{a}Bob,25\x{a}", "row.names => 'name' uses that column as labels", HASH(0x56417a272250), "row.names", "name") called at t/write_table.t line 203
write_table gives better warnings for incorrect types of data given
Numerous changes to dist.ini to improve CPAN testing, especially for Win32
0.12
add_data can also take hash of arrays, and various mixes of data types
ljoin: Addition of restrict keywords in many places; should improve CPU performance
Better POD formatting, correction of output hash for README's add_data
chisq_test can now accept hash of hashes as input
new transpose function for switching 2D hash keys and 2D array indices, and col2col for comparing columns against columns
removed unused function from C helpers
value_counts: addition of restrict keywords in preinit, should improve CPU performance
MANIFEST.skip changed to MANIFEST.SKIP to improve CPAN testing
using is_deeply for tests of transpose, which may or may not work with CPAN testers (experimental)
Added function name to warnings, so I actually know which function is producing the error
write_table can also take file and data as args, in addition to positions
fixed write_table as it could hang if given empty col.names or row.names
Added __EXTENSIONS__ to source XS file for better CPAN testing
0.11
better POD formatting for tables
addition of MANIFEST.skip to get better testing results on CPAN
glm: bugfix for when there is no intercept in the formula, new test cases in t/glm.t
write_table now accepts simple hashes as input, in addition to hash of arrays, hash of hashes, and arrays of hashes
Better documentation for t-test
0.10
changes to compilation for CPAN, trying to get this work on Windows
Addition of prcomp and value_counts
matrix will work without key names, just like in R. Testing for matrix has improved.
0.09
context changes in XS dTHX, pTHX_, and aTHX_ to get better CPAN testing results
restrict keywords added to lm to increase speed
0.08
Speed improvement in summary of hashes.
Addition of add_data, dnorm, group_by, ljoin, and mode functions
Chi-squared function no longer has Perl wrapper, and all code is in XS, which should result in a minor speed increase with 1 less function call.
Compiler changes for GNU source and inclusion of strings.h, to ensure more CPAN testing works better.
read_table now returns hash-of-hash in {row}{column}
0.07
Addition of summary function.
Formulas can now be omitted from aov, resulting in a stacked calculation as R would think.
Addition of oneway_test for multi-group comparisons that does not assume normality like aov does.
read_table and write_table now automatically set separators for .csv files as , and .tsv files as "\t", respectively, so these values no longer need to be specified separately from the file name.
0.06
Changed compiler options so that Solaris will work
signed integers changed to unsigned in glm
Added restrict keywords to power_t_test, and made int to unsigned int
0.05
Leak testing for sample
removal of Data::Printer dependency for easier CPAN testing
switched several unsigned int variable to I32 so that clang doesn't complain
added restrict keyword for sample
0.04
addition of sample function
GNU source, to maximize compatibility and ease installation
removal of JSON dependency to ease installation
0.03
Compatibility back to Perl 5.10
0.02
back-compatible to Perl 5.10, instead of original 5.40, ensuring more people can use it
added var_test
mean, min, sum, median, var, and max die with undefined values, and print the offending indices
"group_stats" added to aov, for TukeyHSD in the future
"cor" dies when given data with standard deviation of 0
write_table now has undef.val option, which shows how undefined values are printed to tables, which is NA by default.