NAME

Statistics::FisherPitman - Randomization-based alternative to one-way independent groups ANOVA; unequal variances okay

SYNOPSIS

use Statistics::FisherPitman;

my @dat1 = (qw/12 12 14 15 12 11 15/);
my @dat2 = (qw/13 14 18 19 22 21 26/);

my $fishpit = Statistics::FisherPitman->new();
$fishpit->load({d1 => \@dat1, d2 => \@dat2});

# Oh, more data just came in:
my @dat3 = (qw/11 7 7 2 19 19/);
$fishpit->add({d3 => \@dat3});

$fishpit->test(resamplings => 1000)->dump(title => "A test");

DESCRIPTION

Tests for a difference between independent samples. It is commonly recommended as an alternative to the oneway independent groups ANOVA when variances are unequal, as its test-statistic, T, is not dependent on an estimate of variance. As a randomization test, it is "distribution-free", with the probability of obtaining the observed value of T being derived from the data themselves.

METHODS

load

$fishpit->load('aname', @data1)
$fishpit->load('aname', \@data1)
$fishpit->load({'aname' => \@data1, 'another_name' => \@data2})

Alias: load_data

Accepts either (1) a single name => value pair of a sample name, and a list (referenced or not) of data; or (2) a hash reference of named array references of data. The data are loaded into the class object by name, within a hash called data, as Statistics::Descriptive::Full objects. So you could get at the data again, for instance, by going $fishpit->{'data'}->{'data1'}->get_data(). The names of the data can be arbitrary. Each call unloads any previous loads.

Returns the Statistics::FisherPitman object.

add

$fishpit->add('another_name', @data2)
$fishpit->add('another_name', \@data2)
$fishpit->add({'another_name' => \@data2})

Alias: add_data

Same as load except that any previous loads are not unloaded.

unload

$fishpit->unload();

Empties all cached data and calculations upon them, ensuring these will not be used for testing. This will be automatically called with each new load, but, to take care of any development, it could be good practice to call it yourself whenever switching from one dataset for testing to another.

test

$fishpit->test(resamplings => 'non-negative number')

Calculates the T-statistic for the loaded data, and, if you send a positive value for resamplings, it is taken that you want a randomization test, in which case the loaded data will be shuffled so many times, and the T-value calculated for each resampling. The proportion of T-values in these resamplings that are greater than or equal to the T-value of the original data, as loaded, is the p_value you base your significance considerations upon.

T is calculated as follows

          g
 T =  SUM  ni xi2
        i = 1

which pertains to the number of observations in each i of g samples, and

                 ni
 xi =  1/ni SUM  xij
               j = 1

(for each j observation in the i sample).

Permutation is simply performed by pooling all the data and, for each resampling, giving them a Fisher-Yates shuffle, and distributing them to so many groups, of so many sample-sizes, as in the original dataset.

The class object is fed with the attributes t_value and p_value, and returns only itself.

dump

$fishpit->dump()

Prints a line to STDOUT of the form T = t_value, p = p_value. Above this string, a title can also be printed, by giving a value to the optional title attribute.

EXAMPLE

This example is taken from Berry & Mielke (2002); a script of the same is included in the ex/fishpit.pl file of the installation dist. The following (real) data are lead values (in mg/kg) of soil samples from two districts in New Orleans, one from school grounds, another from surrounding streets. Was there a significant difference in lead levels between the samples? The variances were determined to be unequal, and the Fisher-Pitman test put to the question. As there were over 100 billion possible permutations of the data, a large number of resamplings was used: 10 million.

The following shows how the test would be performed with the present module; using a smaller number of resamplings produces much the same result. A test of equality of variances is also shown.

my @dist1 = (qw/16.0 34.3 34.6 57.6 63.1 88.2 94.2 111.8 112.1 139.0 165.6 176.7 216.2 221.1 276.7 362.8 373.4 387.1 442.2 706.0/);
my @dist2 = (qw/4.7 10.8 35.7 53.1 75.6 105.5 200.4 212.8 212.9 215.2 257.6 347.4 461.9 566.0 984.0 1040.0 1306.0 1908.0 3559.0 21679.0/);

# First test equality of variances:
require Statistics::ANOVA;
my $anova = Statistics::ANOVA->new();
$anova->load_data({dist1 => \@dist1, dist2 => \@dist2});
$anova->levene_test()->dump();
# This prints: F(1, 38) = 4.87100593921132, p = 0.0334251996755789
# Being significantly different by this test ...

require Statistics::FisherPitman;
my $fishpit = Statistics::FisherPitman->new();
$fishpit->load_data({dist1 => \@dist1, dist2 => \@dist2});
$| = 1; # this could take a little while
$fishpit->test(resamplings => 10000)->dump();
# This prints, e.g.: T = 56062045.0525, p = 0.0145

Hence a difference is indicated, which can be determined by the means. The data being cached as Statistics::Descriptives objects (see load, the means can be got at thus:

print "District 1 mean = ", $fishpit->{'data'}->{'dist1'}->mean(), "\n"; # 203.935
print "District 2 mean = ", $fishpit->{'data'}->{'dist2'}->mean(), "\n"; # 1661.78

So beware District 2, it seems. Berry and Mielke reported the same T-value, and p = .0148 from their 10 million resamplings. They also showed that common alternatives for the unequal variances situation - such as the pooled variance t-test for independent samples, and oneway ANOVA with logarithmic transformation of the data - failed to detect a significant difference between the samples; not a negligible failure given the social health implications.

EXPORT

None by default.

REFERENCES

Berry, K. J., & Mielke, P. W., Jr., (2002). The Fisher-Pitman permutation test: An attractive alternative to the F test. Psychological Reports, 90, 495-502.

SEE ALSO

Statistics::ANOVA Firstly test your independent groups data with the Levene's or O'Brien's equality of variances test in this package to see if they satisfy assumptions of the ANOVA; if not, happily use Fisher-Pitman instead.

BUGS/LIMITATIONS

Computational bugs will hopefully be identified with usage over time.

Optimisation welcomed.

Confidence intervals is something to work on.

REVISION HISTORY

v 0.01

June 2008: Initital release via PAUSE.

See CHANGES in installation distribution for subsequent updates.

AUTHOR/LICENSE

rgarton@utas_DOT_edu_DOT_au

This program is free software. It may may be modified, used, copied, and redistributed at your own risk, and under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html). Publicly redistributed modified versions must use a different name.

Disclaimer

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.