NAME
Statistics::CaseResampling - Efficient resampling and calculation of medians with confidence intervals
SYNOPSIS
use Statistics::CaseResampling ':all';
my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
#my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
# calculate the median of the sample with lower and upper confidence
# limits using resampling/bootstrapping:
my ($lower_cl, $median, $upper_cl)
= median_simple_confidence_limits($sample, $confidence);
# There are many auxiliary functions:
my $resampled = resample($sample);
# $resampled is now a random set of measurements from $sample,
# including potential duplicates
my $medians = resample_medians($sample, $n_resamples);
# $medians is now an array reference containing the medians
# of $n_resamples resample runs
# This is vastly more efficient that doing the same thing with
# repeated resample() calls!
# Analogously:
my $means = resample_means($sample, $n_resamples);
# you can get the cl's from a set of separately resampled medians, too:
my ($lower_cl, $median, $upper_cl)
= simple_confidence_limits_from_samples($median, $medians, $confidence);
# utility functions:
print median([1..5]), "\n"; # prints 3
print mean([1..5]), "\n"; # prints 3, too, surprise!
print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum
DESCRIPTION
The purpose of this (XS) module is to calculate the median (or in principle also other statistics) with confidence intervals on a sample. To do that, it uses a technique called bootstrapping. In a nutshell, it resamples the sample a lot of times and for each resample, it calculates the median. From the distribution of medians, it then calculates the confidence limits.
In order to implement the confidence limit calculation, various other functions had to be implemented efficiently (both algorithmically efficient and done in C). These functions may be useful in their own right and are thus exposed to Perl. Most notably, this exposes a median (and general selection) algorithm that works in linear time as opposed to the trivial implementation that requires O(n*log(n))
.
Random numbers
The resampling involves drawing many random numbers. Therefore, the module comes with an embedded Mersenne twister (taken from Math::Random::MT
).
If you want to change the seed of the RNG, do this:
$Statistics::CaseResampling::Rnd
= Statistics::CaseResampling::RdGen::setup($seed);
or
$Statistics::CaseResampling::Rnd
= Statistics::CaseResampling::RdGen::setup(@seed);
Do not use the embedded random number generator for other purposes. Use Math::Random::MT
instead! At this point, you cannot change the type of RNG.
EXPORT
None by default.
Can export any of the functions that are documented below using standard Exporter
semantics, including the customary :all
group.
FUNCTIONS
This list of functions is loosely sorted from basic to comprehensive because the more complicated functions are usually (under the hood, in C) implemented using the basic building blocks. Unfortunately, that means you may want to read the documentation backwards. :)
Additionally, there is a whole set of general purpose, fast (XS) functions for calculating statistical metrics. They're useful without the bootstrapping related stuff, so they're listed in the "OTHER FUNCTIONS" section below.
All of these functions are written in C for speed.
resample(ARRAYREF)
Returns a reference to an array containing N random elements from the input array, where N is the length of the original array.
This is different from shuffling in that it's random drawing without removing the drawn elements from the sample.
resample_medians(ARRAYREF, NMEDIANS)
Returns a reference to an array containing the medians of NMEDIANS
resamples of the original input sample.
resample_means(ARRAYREF, NMEANS)
Returns a reference to an array containing the means of NMEANS
resamples of the original input sample.
simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES, CONFIDENCE)
Calculates the confidence limits for STATISTIC. Returns the lower confidence limit, the statistic, and the upper confidence limit. STATISTIC_SAMPLES is the output of, for example, resample_means(\@sample)
. CONFIDENCE indicates the fraction of data within the confidence limits (assuming a normal, symmetric distribution of the statistic => simple confidence limits).
For example, to get the 90% confidence (~2 sigma) interval for the mean of your sample, you can do the following:
my $sample = [<numbers>];
my $means = resample_means($sample, $n_resamples);
my ($lower_cl, $mean, $upper_cl)
= simple_confidence_limits_from_samples(mean($sample), $means, 0.90);
If you want to apply this logic to other statistics such as the first or third quartile, you have to manually resample and lose the advantage of doing it in C:
my $sample = [<numbers>];
my $quartiles = [];
foreach (1..1000) {
push @$quartiles, first_quartile(resample($sample));
}
my ($lower_cl, $firstq, $upper_cl)
= simple_confidence_limits_from_samples(
first_quartile($sample), $quartiles, 0.90
);
For a reliable calculation of the confidence limits, you should use at least 1000 samples if not more. Yes. This whole procedure is expensive.
For medians, however, use the following function:
median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
Calculates the confidence limits for the CONFIDENCE
level for the median of SAMPLE. Returns the lower confidence limit, the median, and the upper confidence limit.
In order to calculate the limits, a lot of resampling has to be done. NSAMPLES defaults to 1000
. Try running this a couple of times on your data interactively to see how the limits still vary a little bit at this setting.
OTHER FUNCTIONS
approx_erf($x)
Calculates an approximatation of the error function of x. Implemented after
Winitzki, Sergei (6 February 2008).
"A handy approximation for the error function and its inverse" (PDF).
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
Quoting: Relative precision better than 1.3e-4
.
approx_erf_inv($x)
Calculates an approximation of the inverse of the error function of x.
Algorithm from the same source as approx_erf
.
Quoting: Relative precision better than 2e-3
.
nsigma_to_alpha($nsigma)
Calculates the probability that a measurement from a normal distribution is further away from the mean than $nsigma
standard deviations.
The confidence level (what you pass as the CONFIDENCE
parameter to some functions in this module) is 1 - nsigma_to_alpha($nsigma)
.
alpha_to_nsigma($alpha)
Inverse of nsigma_to_alpha()
.
mean(ARRAYREF)
Calculates the mean of a sample.
median(ARRAYREF)
Calculates the median (second quartile) of a sample. Works in linear time thanks to using a selection instead of a sort.
Unfortunately, the way this is implemented, the median of an even number of parameters is, here, defined as the n/2-1
th largest number and not the average of the n/2-1
th and the n/2
th number. This shouldn't matter for nontrivial sample sizes!
median_absolute_deviation(ARRAYREF)
Calculates the median absolute deviation (MAD) in what I believe is O(n). Take care to rescale the MAD before using it in place of a standard deviation.
first_quartile(ARRAYREF)
Calculates the first quartile of the sample.
third_quartile(ARRAYREF)
Calculates the third quartile of the sample.
select_kth(ARRAYREF, K)
Selects the kth smallest element from the sample.
This is the general function that implements the median/quartile calculation. You get the median with this definition of K:
my $k = int(@$sample/2) + (@$sample & 1);
my $median = select_kth($sample, $k);
sample_standard_deviation
Given the sample mean and an anonymous array of numbers (the sample), calculates the sample standard deviation.
population_standard_deviation
Same as sample_standard_deviation, but without the correction to N
.
TODO
One could calculate more statistics in C for performance.
SEE ALSO
On the approximation of the error function:
Winitzki, Sergei (6 February 2008).
"A handy approximation for the error function and its inverse" (PDF).
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
The ~O(n) median implementation is based on C.A.R. Hoare's quickselect algorithm. See http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm. Right now, it does not implement the Median of Medians algorithm that would guarantee linearity.
AUTHOR
Steffen Mueller, <smueller@cpan.org>
Daniel Dragan, <bulk88@hotmail.com>, who supplied MSVC compatibility patches.
COPYRIGHT AND LICENSE
Copyright (C) 2010, 2011, 2012, 2013 by Steffen Mueller
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.0 or, at your option, any later version of Perl 5 you may have available.