NAME
Statistics::FisherPitman - Randomization-based alternative to one-way independent groups ANOVA; unequal variances okay
SYNOPSIS
use Statistics::FisherPitman 0.03;
my @dat1 = (qw/12 12 14 15 12 11 15/);
my @dat2 = (qw/13 14 18 19 22 21 26/);
my $fishpit = Statistics::FisherPitman->new();
$fishpit->load({d1 => \@dat1, d2 => \@dat2});
# Oh, more data just came in:
my @dat3 = (qw/11 7 7 2 19 19/);
$fishpit->add({d3 => \@dat3});
$fishpit->test(resamplings => 1000)->dump(title => "A test");
DESCRIPTION
Tests for a difference between independent samples. It is commonly recommended as an alternative to the oneway independent groups ANOVA when variances are unequal, as its test-statistic, T, is not dependent on an estimate of variance. As a randomization test, it is "distribution-free", with the probability of obtaining the observed value of T being derived from the data themselves.
METHODS
new
$fishpit = Statistics::FisherPitman->new()
Class constructor; expects nothing.
load
$fishpit->load('aname', @data1)
$fishpit->load('aname', \@data1)
$fishpit->load({'aname' => \@data1, 'another_name' => \@data2})
Alias: load_data
Accepts either (1) a single name => value
pair of a sample name, and a list (referenced or not) of data; or (2) a hash reference of named array references of data. The data are loaded into the class object by name, within a hash named data
, as Statistics::Descriptive::Full objects. So you can easily get at any descriptives for the groups you've loaded - e.g., $fishpit->{'data'}->{'aname'}->mean() - or you could get at the data again by going $fishpit->{'data'}->{'aname'}->get_data(); and so on. The names of the data are up to you.
Each call unloads any previous loads.
Returns the Statistics::FisherPitman object.
add
$fishpit->add('another_name', @data2)
$fishpit->add('another_name', \@data2)
$fishpit->add({'another_name' => \@data2})
Alias: add_data
Same as load except that any previous loads are not unloaded.
unload
$fishpit->unload();
Empties all cached data and calculations upon them, ensuring these will not be used for testing. This will be automatically called with each new load, but, to take care of any development, it could be good practice to call it yourself whenever switching from one dataset for testing to another.
test
$fishpit->test(resamplings => 'non-negative number')
Calculates the T-statistic for the loaded data, and, if you send a positive value for resamplings, it is taken that you want a randomization test, in which case the loaded data will be shuffled so many times, and the T-value calculated for each resampling. The proportion of T-values in these resamplings that are greater than or equal to the T-value of the original data, as loaded, is the p_value for basing significance considerations upon.
T is calculated as follows:
g
T = SUM ni xi²
i = 1
which pertains to the number of observations in each i of g samples, and
ni
xi = 1/ni SUM xij
j = 1
(for each j observation in the i sample).
Randomization test is simply based on pooling all the data and, for each resampling, giving them a Fisher-Yates shuffle, and distributing them to so many groups, of so many sample-sizes, as in the original dataset.
The class object is fed with the attributes t_value and p_value, and returns only itself. Confidence interval (95%) of the true proportion (p-value) is also calculated and stored as a two-element array named conf_int. So you can get at these values thus:
print "T = $fishpit->{'t_value'}, p = $fishpit->{'p_value'}\n";
print '95% confidence interval for the proportion of Ts greater than or equal to the observed value ranges from ';
print "$fishpit->{'conf_int'}->[0] to $fishpit->{'conf_int'}->[1].\n";
dump
$fishpit->dump(title => 'A test of something', conf_int => 1|0, p_precision => integer)
Prints a line to STDOUT of the form T = t_value, p = p_value. Above this string, a title can also be printed, by giving a value to the optional title argument. The 95% confidence interval, and the precision of the p-value(s), can also be optionally dumped, as above. Ends with a line-break, i.e., "\n".
string
$fishpit->string(conf_int => 1|0, p_precision => integer)
Returns a line of the form T = t_value, p = p_value, to the precision specified (if any), and, optionally, with the confidence-interval for the p-value appended.
EXAMPLE
This example is taken from Berry & Mielke (2002); see ex/fishpit.pl
in the installation dist for implementation. The following (real) data are lead (Pb) values (in mg/kg) of soil samples from two districts in New Orleans, one from school grounds, another from surrounding streets. Was there a significant difference in lead levels between the samples? The variances were determined to be unequal, and the Fisher-Pitman test put to the question. As there were over 100 billion possible permutations of the data, a large number of resamplings was used: 10 million.
The following shows how the test would be performed with the present module; using a smaller number of resamplings produces much the same result. A test of equality of variances is also shown.
my $data = {
dist1 => [qw/16.0 34.3 34.6 57.6 63.1 88.2 94.2 111.8 112.1 139.0 165.6 176.7 216.2 221.1 276.7 362.8 373.4 387.1 442.2 706.0/],
dist2 => [qw/4.7 10.8 35.7 53.1 75.6 105.5 200.4 212.8 212.9 215.2 257.6 347.4 461.9 566.0 984.0 1040.0 1306.0 1908.0 3559.0 21679.0/],
};
# First test equality of variances:
require Statistics::ANOVA;
my $anova = Statistics::ANOVA->new();
$anova->load_data($data);
$anova->levene_test()->dump();
# This prints: F(1, 38) = 4.87100593921132, p = 0.0334251996755789
# As this suggests significantly different variances ...
require Statistics::FisherPitman;
my $fishpit = Statistics::FisherPitman->new();
$fishpit->load_data($data);
$| = 1; # this could take a little while
$fishpit->test(resamplings => 10000)->dump(conf_int => 1, p_precision => 3);
# This prints, e.g.: T = 56062045.0525, p = 0.014 (95% CI: 0.011, 0.016)
Hence a difference is indicated, which can be identified from the means. The data being cached as Statistics::Descriptive objects (see load), the means can be got at thus:
print "District 1 mean = ", $fishpit->{'data'}->{'dist1'}->mean(), "\n"; # 203.935
print "District 2 mean = ", $fishpit->{'data'}->{'dist2'}->mean(), "\n"; # 1661.78
So beware District 2, it seems. The module naturally produces the same T-value as reported by Berry and Mielke, and they obtained p = .0148 from their 10 million resamplings.
Pointing to the value of the test, Berry and Mielke also showed that common alternatives for the unequal variances situation - such as the pooled variance t-test for independent samples, and oneway ANOVA with logarithmic transformation of the data - failed to detect a significant difference between the samples; not a negligible failure given the social health implications.
REFERENCES
Berry, K. J., & Mielke, P. W., Jr., (2002). The Fisher-Pitman permutation test: An attractive alternative to the F test. Psychological Reports, 90, 495-502.
SEE ALSO
Statistics::ANOVA Firstly test your independent groups data with the Levene's or O'Brien's equality of variances test in this package to see if they satisfy assumptions of the ANOVA; if not, happily use Fisher-Pitman instead.
LIMITATIONS/TO DO
Optimisation welcomed.
Do auto number of resamplings based on N possible permutations.
Randomization procedure can always be improved.
REVISION HISTORY
See CHANGES in installation distribution for subsequent updates.
AUTHOR/LICENSE
- Copyright (c) 2006-2008 Roderick Garton
-
rgarton AT cpan DOT org
This program is free software. This module is free software. It may be used, redistributed and/or modified under the stame terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).
- Disclaimer
-
To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.