NAME
Statistics::Sequences - Tests of sequential structure in the form of runs, joins, bunches, etc.
SYNOPSIS
use Statistics::Sequences;
$seq = Statistics::Sequences->new();
$seq->load([1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1]); # dichotomous values
$seq->test('runs')->dump(); # or 1st argument to test is 'joins' or 'pot'
# (prints:)
# Runs: expected = 7.00, observed = 7.00, z = -0.30, p = 0.76206
DESCRIPTION
Loading and preparing data for statistical tests of their sequential structure via Statistics::Sequences::Runs, Statistics::Sequences::Joins, and Statistics::Sequences::Pot. Examples of the use of each test are given in these pages.
In general, to access the tests, you use this base module to directly create a Statistics::Sequences object with the new method, load data into it, and then access each test by calling the test method. This approach is useful for running several tests on the same data, as the data are immediately available to each test (of runs, pot and joins). See the SYNOPSIS for a simple example.
If you only want to perform a test of one type (e.g., runs), you might want to simply use the relevant sub-package, create a class object specific to it, and load data specfically for its use; see the SYNOPSIS for the particular test, i.e., Runs, Joins or Pot. You won't be able to access other tests by this approach, unless you create another object for that test, and then specifically pass the data from the earlier object into the new one.
Note also that there are methods to anonymously or nominally cache data, and that data might need to be reduced to a dichotomous format, before a valid test can be run. Several dichotomising methods are provided, once data are loaded, and accessible via the generic or specific class objects, as above.
METHODS
Interface
The package provides an object-oriented interface for performing the Runs-, Joins- and Pot-tests of sequences.
Most methods come with aliases, should you be used to referring to Perl statistics methods by one or another of the conventions. Present methods are mostly based on those used in Juan Yun-Fang's modules, e.g., Statistics::ChisqIndep.
new
$seq = Statistics::Sequences->new();
Returns a new Statistics::Sequences object by which all the methods for caching, dichotomising, and testing data can be accessed, including each of the methods for performing the Runs-, Joins- and Pot-tests. The parameters corr
, tails
and p_precision
can be usefully set here, during construction, to be used by all tests.
Any one of the sub-packages, such as Statistics::Sequences::Runs, can be individually imported, and its own new method can be called, e.g.:
use Statistics::Sequences::Runs;
$runs = Statistics::Sequences::Runs->new();
In this case, data are not automatically shared across packages, and only one test (in this case, the Runs-test) can be accessed through the class-object returned by new.
Caching data
load
$seq->load(@data); # Anonymous load
$seq->load(\@data); # Anonymously referenced load
$seq->load(blues => \@blue_scores, reds => \@red_scores); # Named loads
$seq->load({blues => \@blue_scores, reds => \@red_scores}); # Same, but referenced
Aliases: load_data
, add_data
Cache an anonymous list of data as an array-reference, or named data-sets as a hash reference, accessible as $seq->{'data'}
, and available over any number of tests. Each call to load removes whatever might have been previously loaded. Sending nothing deletes all loaded data (by undef
fing $seq->{'data'}
); sending another list makes another set of data available for testing.
Anonymous and named loading, and function aliases, are provided given the variety of such methods throughout the Statistics modules. Telling the difference between an unreferenced array (for anonymous loading) and an unreferenced hash (for nominal loading) is simply performed on the basis of the second element: if it's a reference, the list is taken as a hash, otherwise as an array. Inelegant, but accommodating.
Dichotomising data
Both the runs- and joins-tests expect dichotomous data, i.e., as if there were only two categorical variables. Numerical and multi-valued categorical data, once loaded, can be "reduced" to this format by the following methods, namely, cut, match and pool. Or supply data in this format. Both the runs- and joins-test will croak
if more (or less) than two events are found in the data.
Each method stores the data in the class object as an array-reference named "testdata", accessed so:
print 'dichotomous data: ', @{$seq->{'testdata'}}, "\n";
cut
$seq->cut(point => 'median', equal => 'gt'); # cut anonymously cached data at a central tendency
$seq->cut(point => 23); # cut anonymously cached data at a specific value
$seq->cut(point => 'mean', data => 'blues'); # cut named data at its average
This method is only suitable for numerical data.
Reduce loaded data to two categories by cutting it about a certain value. For example, the following raw data, when cut for values greater than or equal to 5, yield the subsequent dichotomous series.
@raw_data = (4, 3, 3, 5, 3, 4, 5, 6, 3, 5, 3, 3, 6, 4, 4, 7, 6, 4, 7, 3);
@cut_data = (0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0);
The following options may be specified.
- point => 'mean|median|mode|\d+'
-
Specify the value at which the data will be cut. This could be the mean, median or mode (as calculated by Statistics::Lite), or a numerical value within the range of the data. The default is the median. The cut-value, as specified by
point
, can be retrieved thus:print $seq->{'cut_value'};
- equal => 'gt|lt|0'
-
This option specifies how to cut the data should the cut-value (as specified by
point
) be present in the data. The default value is 0: observations equal to the cut-value are skipped. Ifequal => gt
: all data-values greater than or equal to the cut-value will form one group, and all data-values less than the cut-value will form another. To cut with all values less than or equal to in one group, and higher values in another, set this parameter to lt. - data => 'string'
-
Specify which named cached data-set to cut.
match
$seq->match('data' => ['blues', 'reds']);
Reduce two lists of loaded data to two categories in a single array, according to the match between the elements at each index. Where the data-values are equal at a certain index, they will be represented with a 1; otherwise a 0. Numerical or stringy data can be equated. For example, the following two arrays would be reduced to the third, where a 1 indicates a match of identical values in the two data sources.
@blues = (qw/1 3 3 2 1 5 1 2 4/);
@reds = (qw/4 3 1 2 1 4 2 2 4/);
@dicho = (qw/0 1 0 1 1 0 0 1 1/);
The following options may be specified.
- data => [qw/blues reds/]
-
Specify, a referenced array, two named data-sets, as previously passed to load. An attempt to match a number of data-sets other than 2 will emit a
croak
. - lag => integer OR [integer, loop (boolean)] (where integer < number of observations or integer > -1 (number of observations) )
-
Match the two data-sets by shifting the first named set ahead or behind the other data-set by lag observations. The default is zero. For example, one data-set might be targets, and another responses to the targets:
targets = cbbbdacdbd responses = daadbadcce
Matched as a single sequence of hits (1) and misses (0) where lag = 0 yields (for the match on "a" in the 6th index of both arrays):
0000010000
If
lag
is set to +1, however, each response is associated with the target one ahead of the trial for which it was meant; i.e., each target is shifted to its +1 index. So the first element in the above responses (d) would be associated with the second element of the targets (b), and so on. Now, matching the two data-sets with a +1 lag gives two hits, of the 4th and 7th elements of the responses to the 5th and 8th elements of the targets, respectively:000100100
Note that with a lag of zero, there were 3 runs of (a) hit/s or miss/es, but with a lag of 1, there were 5 runs.
Lag values can be negative, so that a -2 lag, for instance, will give a hit/miss series of:
00101010
Here, responses necessarily start at the third element (a), the first hits occurring when the fifth response-element corresponds to the the third target element (b).
In the above example, the last response (e) could not be used, and the number of elements in the hit/miss sequence became n-lag less the original target sequence. This means that the maximum value of lag must be one less the size of the data-sets, or there will be no data.
You can, alternatively, preserve all lagged data by looping any excess to the start or end of the criterion data. The number of observations will then always be the same, regardless of the lag. Matching the data in the example above with a lag of +1, with looping, creates an additional match between the final response and the first target (d):
1000100100
To effect looping, send a referenced list of the lag and a boolean for the loop, e.g., :
lag => [-3, 1]
pool
$seq->pool('data' => ['blues', 'reds']);
Reduce two sets of cached numerical data to two categories in a single array by pooling the data according to the magnitude of the values at each trial. For example, the following two lists would be reduced to the third by assigning a score for the data-set with the lowest (or equal) value on a trial, where 0 indicates a member of the first data-set, and 1 indicates a member of the second data-set; 5 runs emerge in the final data-set.
@blues = (qw/1 1 2 2 2 3 4/);
@reds = (qw/1 2 2 3 3 4 4/);
@dicho = (qw/0 0 1 0 0 0 1 1 0 1 1 0 1 1/);
swing
$seq->swing();
$seq->swing(data => 'reds'); # if more than one are loaded, or a single one was loaded with a name
This is another transformation that, like the cut method, can be used to produce a dichotomous sequence from a single set of numerical data. You essentially test the degree of consistency of the rises and falls in the data. Each element in the named data-set is subtracted from its successor, and the result is replaced with a 1 if the difference represents an increase, or 0 if it represents a decrease. For example, the following numerical series produces the subsequent dichotomous series.
@values = (qw/3 4 7 6 5 1 2/);
@dicho = (qw/1 1 0 0 0 1/);
Dichotomously, the data can be seen as commencing with an ascending run of length 2, followed by a descending run of length 3, and finishing with a short increase. Note that the number of resulting observations is less than the original number.
Note that the critical region of the distribution lies (only) in the upper-tail; a one-tailed test of significance is appropriate. This is automatically set, but can be specified by sending tails = > 1
to test.
- equal => 'gt|lt|rpt|0'
-
The default result when the difference between two successive values is zero is to skip the observation, and move onto the next succession (
equal => 0
). Alternatively, you may wish to repeat the result for the previous succession; skipping only a difference of zero should it occur as the first result ((equal => 'rpt'
)). Or, a difference greater than or equal to zero is counted as an increase (equal => 'gt'
), or a difference less than or equal to zero is counted as a decrease. For example,@values = (qw/3 3 7 6 5 2 2/); @dicho_def = (qw/1 0 0 0/); # First and final results (of 3 - 3, and 2 - 2) are skipped @dicho_rpt = (qw/1 0 0 0 0/); # First result (of 3 - 3) is skipped, and final result repeats the former @dicho_gt = (qw/1 1 0 0 0 1/); # Greater than or equal to zero is an increase @dicho_lt = (qw/0 1 0 0 0 0/); # Less than or equal to zero is a decrease
Testing data
test
$seq->test('runs');
$seq->test('joins');
$seq->test('pot', event => 'a value appearing in the testdata');
$runs->test();
$joins->test(prob => 1/2);
$pot->test(event => 'circle');
Alias: process
When using a Statistics::Sequences class-object, this method requires specification of which test to perform, i.e., runs, joins or pot; the name of the test is simply given as the first argument. This is not required when the class-object already refers to one of the sub-modules, as created by the new
method within Statistics::Sequences::Runs, Statistics::Sequences::Joins, and Statistics::Sequences::Pot.
General options
Options to test available to all the sub-package tests are as follows.
- data => 'string'
-
Optionally specify the name of the data to be tested. By default, this is not required: the testdata are those that were last loaded, either anonymously, or as given to one of the dichotomising methods. Otherwise, if the data are already ready for testing in a dichotomous format, data that were previously loaded by name can be individually tested. For example, here are two sets of data that are loaded by name, and then a single test of one of them is performed.
@chimps = (qw/banana banana cheese banana cheese banana banana banana/); @mice = (qw/banana cheese cheese cheese cheese cheese cheese cheese/); $seq->load(chimps => \@chimps, mice => \@mice); $seq->test('runs', data => 'chimps')->dump();
- ccorr => boolean
-
Specify whether or not to perform the continuity-correction on the observed deviation. Default is false. See Statistics::Deviation.
- tails => 1|2
-
Specify whether the z-value is calculated for both sides of the normal distribution (2, the default for most testdata) or only one side (the default for data prepared with the swing method).
Test-specific required settings and options
Some sub-package tests need to have parameters defined in the call to test, and/or have specific options, as follows.
Joins : The Joins-test optionally allows the setting of a probability value; see test
in the Statistics::Sequences::Joins manpage.
Pot : The Pot-test requires the setting of an event to be tested; see test
in the Statistics::Sequences::Pot manpage.
Runs : There are presently no specific requirements nor options for the Runs-test.
Series testing (provisional)
A means to aggregate results from multiple tests is exploratively supported, but only when accessing the tests through the main Sequences package. Three methods are presently used to effect this.
series_init
Clears any already accumulated data from previous tests.
series_update
Called once you have performed a test on a sample. It caches the observed, expectation and variance values from the test.
series_test
Sums the observed, expectation and variance values from all the tests updated to the series since calling series_init, and produces a z_value from these sums. It returns nothing in particular, but the following statement shows how the series values can be accessed.
print "Series summed runs:
expected = ", $seq->{'series'}->{'expected'}, "
observed = ", $seq->{'series'}->{'observed'},"
z = $seq->{'series'}->{'z_value'}, $seq->{'tails'}-p = $seq->{'series'}->{'p_value'}\n";
Accessing results
All relevant statistical values are "lumped" into the class-object, and can be retrieved thus:
$seq->{'observed'} # The observed value of the test-statistic (Runs, Joins, Pot)
$seq->{'expected'} # The expected value of the test-statistic (Runs, Joins, Pot)
$seq->{'obs_dev'} # The observed deviation (observed minus expected values), continuity-corrected, if so specified
$seq->{'std_dev'} # The standard deviation
$self->{'variance'} # Variance
$seq->{'z_value'} # The value of the z-statistic (ratio of observed to standard deviation)
$seq->{'p_value'} # The "normal probability" associated with the z-statistic
dump
$seq->dump(data => '1|0', flag => '1|0', text => '0|1|2', s_precision => 'integer', p_precision => 'integer');
Alias: print_summary
Print results of the last-conducted test to STDOUT. By default, if no parameters to dump
are passed, a single line of test statistics are printed. Options are as follows.
- data => boolean
-
If true, the tested data are printed in a single line. The default is false.
- flag => boolean
-
If true, the p-value associated with the z-test of significance is appended with a single asterisk if the value if below .05, and with two asterisks if it is below .01.
If false (default), nothing is appended to the p-value.
- text => 0|1|2
-
If set to 1 (the default for an empty call to dump, a single line is printed, beginning with the name of the test, then the observed and expected values of the test-statistic, and the z-value and its associated p-value. The Pot-test, additionally, shows the event tested in parentheses after the test-name. For example:
Joins: expected = 400.00, observed = 360.00, z = -2.83, p = 0.0023389** Runs: expected = 398.86, observed = 361.00, z = -2.70, p = 0.0070374** Pot(1): expected = 288.51, observed = 303.63, z = 2.64, p = 0.0082920**
If set to anything greater than 1, more verbose info is printed: each of the above bits of info are printed, on separate lines, as well as the observed and standard deviations, and a statement of significance of the z-test, all in English.
If set to zero, no statistics are printed. This setting is useful if, alternatively, you only want to print the testdata (by setting the data parameter to 1).
- s_precision => 'non-negative integer'
-
Precision of the z-statistic.
- p_precision => 'non-negative integer'
-
Specify rounding of the probability associated with the z-test to so many digits. If zero or undefined, you get everything available.
string
$seq->string()
Returns a single line giving the z_value and p_value. Accepts the s_precision, p_precision and flag options, as for dump.
SEE ALSO
Statistics::Burst : Another test of sequences.
TO DO/BUGS
Results are dubious if there are only two observations.
Support for non-z-testing, and using poisson distribution for low number of observations
Multivariate extension to Runs test, at least
Sort option for pool method ?
REVISION HISTORY
AUTHOR
Roderick Garton, <rgarton@utas_DOT_edu_DOT_au>
COPYRIGHT/LICENSE/DISCLAIMER
Copyright (C) 2007-2008 Roderick Garton
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.
To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.