lib/Statistics/CaseResampling.pm


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
—
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
              package Statistics::CaseResampling;
use 5.008001;
use strict;
use warnings;
require Exporter;
our $VERSION = '0.15';
our @ISA = qw(Exporter);
our @EXPORT_OK = qw(
  resample
  resample_medians
  resample_means
  select_kth
  median
  median_absolute_deviation
  first_quartile
  third_quartile
  mean
  simple_confidence_limits_from_samples
  median_simple_confidence_limits
  approx_erf
  approx_erf_inv
  nsigma_to_alpha
  alpha_to_nsigma
  sample_standard_deviation
  population_standard_deviation
);
our @EXPORT = qw();
our %EXPORT_TAGS = ('all' => \@EXPORT_OK);
require XSLoader;
XSLoader::load('Statistics::CaseResampling', $VERSION);
our $Rnd = Statistics::CaseResampling::RdGen::setup(rand());
1;
__END__
=head1 NAME
Statistics::CaseResampling - Efficient resampling and calculation of medians with confidence intervals
=head1 SYNOPSIS
  use Statistics::CaseResampling ':all';
   
  my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
  my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
  #my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
   
  # calculate the median of the sample with lower and upper confidence
  # limits using resampling/bootstrapping:
  my ($lower_cl, $median, $upper_cl)
    = median_simple_confidence_limits($sample, $confidence);
   
  # There are many auxiliary functions:
   
  my $resampled = resample($sample);
  # $resampled is now a random set of measurements from $sample,
  # including potential duplicates
   
  my $medians = resample_medians($sample, $n_resamples);
  # $medians is now an array reference containing the medians
  # of $n_resamples resample runs
  # This is vastly more efficient that doing the same thing with
  # repeated resample() calls!
  # Analogously:
  my $means = resample_means($sample, $n_resamples);
   
  # you can get the cl's from a set of separately resampled medians, too:
  my ($lower_cl, $median, $upper_cl)
    = simple_confidence_limits_from_samples($median, $medians, $confidence);
   
  # utility functions:
  print median([1..5]), "\n"; # prints 3
  print mean([1..5]), "\n"; # prints 3, too, surprise!
  print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum
=head1 DESCRIPTION
The purpose of this (XS) module is to calculate the median (or in principle
also other statistics) with confidence intervals on a sample. To do that,
it uses a technique called bootstrapping. In a nutshell, it resamples the
sample a lot of times and for each resample, it calculates the median.
From the distribution of medians, it then calculates the confidence limits.
In order to implement the confidence limit calculation, various other
functions had to be implemented efficiently (both algorithmically efficient
and done in C). These functions may be useful in their own right and are
thus exposed to Perl. Most notably, this exposes a median (and general selection)
algorithm that works in linear time as opposed to the trivial implementation
that requires C<O(n*log(n))>.
=head2 Random numbers
The resampling involves drawing B<many> random numbers. Therefore,
the module comes with an embedded Mersenne twister (taken from
C<Math::Random::MT>).
If you want to change the seed of the RNG, do this:
  $Statistics::CaseResampling::Rnd
    = Statistics::CaseResampling::RdGen::setup($seed);
  
or
  $Statistics::CaseResampling::Rnd
    = Statistics::CaseResampling::RdGen::setup(@seed);
Do not use the embedded random number generator for other purposes.
Use C<Math::Random::MT> instead! At this point, you cannot change
the type of RNG.
=head2 EXPORT
None by default.
Can export any of the functions that are documented below
using standard C<Exporter> semantics, including the
customary C<:all> group.
=head1 FUNCTIONS
This list of functions is loosely sorted from I<basic>
to I<comprehensive> because the more complicated functions
are usually (under the hood, in C) implemented using the
basic building blocks. Unfortunately, that means you may
want to read the documentation backwards. :)
Additionally, there is a whole set of general purpose, fast (XS)
functions for calculating statistical metrics. They're useful
without the bootstrapping related stuff, so they're listed in
the L</"OTHER FUNCTIONS"> section below.
All of these functions are written in C for speed.
=head2 resample(ARRAYREF)
Returns a reference to an array containing N random elements from the
input array, where N is the length of the original array.
This is different from shuffling in that it's random drawing without
removing the drawn elements from the sample.
=head2 resample_medians(ARRAYREF, NMEDIANS)
Returns a reference to an array containing the medians of
C<NMEDIANS> resamples of the original input sample.
=head2 resample_means(ARRAYREF, NMEANS)
Returns a reference to an array containing the means of
C<NMEANS> resamples of the original input sample.
=head2 simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES, CONFIDENCE)
Calculates the confidence limits for I<STATISTIC>. Returns
the lower confidence limit, the statistic, and the upper confidence
limit. I<STATISTIC_SAMPLES> is the output of, for example, C<resample_means(\@sample)>.
I<CONFIDENCE> indicates the fraction of data within the confidence limits
(assuming a normal, symmetric distribution of the statistic =E<gt> I<simple confidence
limits>).
For example, to get the 90% confidence (~2 sigma) interval for the mean of your sample,
you can do the following:
  my $sample = [<numbers>];
  my $means = resample_means($sample, $n_resamples);
  my ($lower_cl, $mean, $upper_cl)
    = simple_confidence_limits_from_samples(mean($sample), $means, 0.90);
If you want to apply this logic to other statistics such as the first
or third quartile, you have to manually resample and lose the advantage of
doing it in C:
  my $sample = [<numbers>];
  my $quartiles = [];
  foreach (1..1000) {
    push @$quartiles, first_quartile(resample($sample));
  }
  my ($lower_cl, $firstq, $upper_cl)
    = simple_confidence_limits_from_samples(
        first_quartile($sample), $quartiles, 0.90
      );
For a reliable calculation of the confidence limits, you should use at least
1000 samples if not more. Yes. This whole procedure is B<expensive>.
For medians, however, use the following function:
=head2 median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
Calculates the confidence limits for the C<CONFIDENCE> level
for the median of I<SAMPLE>. Returns the lower confidence limit,
the median, and the upper confidence limit.
In order to calculate the limits, a lot of resampling has to be done.
I<NSAMPLES> defaults to C<1000>. Try running this a couple of times
on your data interactively to see how the limits still vary a little
bit at this setting.
=head1 OTHER FUNCTIONS
=head2 approx_erf($x)
Calculates an approximatation of the error function of I<x>.
Implemented after
  Winitzki, Sergei (6 February 2008).
  "A handy approximation for the error function and its inverse" (PDF). 
  http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
Quoting: Relative precision better than C<1.3e-4>.
=head2 approx_erf_inv($x)
Calculates an approximation of the inverse of the
error function of I<x>.
Algorithm from the same source as C<approx_erf>.
Quoting: Relative precision better than C<2e-3>.
=head2 nsigma_to_alpha($nsigma)
Calculates the probability that a measurement from a normal
distribution is further away from the mean than C<$nsigma>
standard deviations.
The confidence level (what you pass as the
C<CONFIDENCE> parameter to some functions in this module)
is C<1 - nsigma_to_alpha($nsigma)>.
=head2 alpha_to_nsigma($alpha)
Inverse of C<nsigma_to_alpha()>.
=head2 mean(ARRAYREF)
Calculates the mean of a sample.
=head2 median(ARRAYREF)
Calculates the median (second quartile)
of a sample. Works in linear time thanks
to using a selection instead of a sort.
Unfortunately, the way
this is implemented, the median of an even number of parameters
is, here, defined as the C<n/2-1>th largest number and not
the average of the C<n/2-1>th and the C<n/2>th number. This shouldn't
matter for nontrivial sample sizes!
=head2 median_absolute_deviation(ARRAYREF)
Calculates the median absolute deviation (MAD) in what I believe is O(n).
Take care to rescale the MAD before using it in place of a standard deviation.
=head2 first_quartile(ARRAYREF)
Calculates the first quartile of the sample.
=head2 third_quartile(ARRAYREF)
Calculates the third quartile of the sample.
=head2 select_kth(ARRAYREF, K)
Selects the kth smallest element from the sample.
This is the general function that implements the median/quartile
calculation. You get the median with this definition of K:
  my $k = int(@$sample/2) + (@$sample & 1);
  my $median = select_kth($sample, $k);
=head2 sample_standard_deviation
Given the sample mean and an anonymous array of numbers (the sample),
calculates the sample standard deviation.
=head2 population_standard_deviation
Same as sample_standard_deviation, but without the correction to C<N>.
=head1 TODO
One could calculate more statistics in C for performance.
=head1 SEE ALSO
L<Math::Random::MT>
On the approximation of the error function:
  Winitzki, Sergei (6 February 2008).
  "A handy approximation for the error function and its inverse" (PDF). 
  http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
The ~O(n) median implementation is based on C.A.R. Hoare's quickselect
algorithm. See
L<http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm>.
Right now, it does not implement the Median of Medians algorithm that would
guarantee linearity.
=head1 AUTHOR
Steffen Mueller, E<lt>smueller@cpan.orgE<gt>
Daniel Dragan, E<lt>bulk88@hotmail.comE<gt>, who supplied MSVC compatibility
patches.
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2010, 2011, 2012, 2013 by Steffen Mueller
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.8.0 or,
at your option, any later version of Perl 5 you may have available.
=cut
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)