NAME

Statistics::Sequences - Tests of sequential structure in the form of runs, joins, bunches, etc.

SYNOPSIS

use Statistics::Sequences;
$seq = Statistics::Sequences->new();
$seq->load([1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1]); # dichotomous values
$seq->test('runs')->dump(); # or 1st argument to test is 'joins' or 'pot'
# (prints:)
# Runs: expected = 7.00, observed = 7.00, z = -0.30, p = 0.76206

DESCRIPTION

Loading and preparing data for statistical tests of their sequential structure via Statistics::Sequences::Runs, Statistics::Sequences::Joins, and Statistics::Sequences::Pot. Examples of the use of each test are given in these pages.

In general, to access the tests, you use this base module to directly create a Statistics::Sequences object with the new method, load data into it, and then access each test by calling the test method. This approach is useful for running several tests on the same data, as the data are immediately available to each test (of runs, pot and joins). See the SYNOPSIS for a simple example.

If you only want to perform a test of one type (e.g., runs), you might want to simply use the relevant sub-package, create a class object specific to it, and load data specfically for its use; see the SYNOPSIS for the particular test, i.e., Runs, Joins or Pot. You won't be able to access other tests by this approach, unless you create another object for that test, and then specifically pass the data from the earlier object into the new one.

Note also that there are methods to anonymously or nominally cache data, and that data might need to be reduced to a dichotomous format, before a valid test can be run. Several dichotomising methods are provided, once data are loaded, and accessible via the generic or specific class objects, as above.

METHODS

Interface

The package provides an object-oriented interface for performing the Runs-, Joins- and Pot-tests of sequences.

Most methods are named with aliases, should you be used to referring to Perl statistics methods by one or another of the conventions. Present conventions are mostly based on those used in Juan Yun-Fang's modules, e.g., Statistics::ChisqIndep.

new

$seq = Statistics::Sequences->new();

Returns a new Statistics::Sequences object by which all the methods for caching, dichotomising, and testing data can be accessed, including each of the methods for performing the Runs-, Joins- and Pot-tests. The parameters corr, tails and precision_p can be usefully set here, during construction, to be used by all tests.

Any one of the sub-packages, such as Statistics::Sequences::Runs, can be individually imported, and its own new method can be called, e.g.:

use Statistics::Sequences::Runs;
$runs = Statistics::Sequences::Runs->new();

In this case, data are not automatically shared across packages, and only one test (in this case, the Runs-test) can be accessed through the class-object returned by new.

Caching data

load

$seq->load(@data); # Anonymous load
$seq->load(\@data); # Anonymously referenced load
$seq->load(blues => \@blue_scores, reds => \@red_scores); # Named loads
$seq->load({blues => \@blue_scores, reds => \@red_scores}); # Same, but referenced

Aliases: load_data

Cache an anonymous list of data as an array-reference, or named data-sets as a hash reference, accessible as $seq->{'data'}, and available over any number of tests. Each call to load removes whatever might have been previously loaded. Sending nothing deletes all loaded data (by undeffing $seq->{'data'}); sending another list makes another set of data available for testing.

Anonymous and named loading, and function aliases, are provided given the variety of such methods throughout the Statistics modules. Telling the difference between an unreferenced array (for anonymous loading) and an unreferenced hash (for nominal loading) is simply performed on the basis of the second element: if it's a reference, the list is taken as a hash, otherwise as an array. Inelegant, but accommodating.

add

$seq->add_data($char1, $char2)
$seq->add_data([$char1, $char2])
$seq->add_data({ reds => 1})

Aliases: add_data

Just push any value(s) or so along, without clobbering what's already in there (as load_data would).

read

$seq->read()

Alias: get_data

Return the hash of data (just return $seq->{'data'}).

unload

$seq->unload()

Alias: clear_data

Empty, clear, clobber what's in there.

Dichotomising data

Both the runs- and joins-tests expect dichotomous data, i.e., as if there were only two categorical variables. Numerical and multi-valued categorical data, once loaded, can be "reduced" to this format by the following methods, namely, cut, swing, pool and match. Or supply data in this format. Both the runs- and joins-test will croak if more (or less) than two states are found in the data.

Each method stores the data in the class object as an array-reference named "testdata", accessed so:

print 'dichotomous data: ', @{$seq->{'testdata'}}, "\n";

Numerical data: Single-sample dichotomisation

cut

$seq->cut(value => 'median', equal => 'gt'); # cut anonymously cached data at a central tendency
$seq->cut(value => 23); # cut anonymously cached data at a specific value
$seq->cut(value => 'mean', data => 'blues'); # cut named data at its average

This method is only suitable for numerical data.

Reduce loaded data to two categories by cutting it about a certain value. For example, the following raw data, when cut for values greater than or equal to 5, yield the subsequent dichotomous series.

@raw_data = (4, 3, 3, 5, 3, 4, 5, 6, 3, 5, 3, 3, 6, 4, 4, 7, 6, 4, 7, 3);
@cut_data = (0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0);

The following options may be specified.

value => 'mean|median|mode|\d+'

Specify the value at which the data will be cut. This could be the mean, median or mode (as calculated by Statistics::Lite), or a numerical value within the range of the data. The default is the median. The cut-value, as specified by value, can be retrieved thus:

print $seq->{'cut_value'};
equal => 'gt|lt|0'

This option specifies how to cut the data should the cut-value (as specified by value) be present in the data. The default value is 0: observations equal to the cut-value are skipped. If equal => gt: all data-values greater than or equal to the cut-value will form one group, and all data-values less than the cut-value will form another. To cut with all values less than or equal to in one group, and higher values in another, set this parameter to lt.

data => 'string'

Specify which named cached data-set to cut.

swing

$seq->swing();
$seq->swing(data => 'reds'); # if more than one are loaded, or a single one was loaded with a name

This is another transformation that, like the cut method, can be used to produce a dichotomous sequence from a single set of numerical data. You essentially test the degree of consistency of the rises and falls in the data. Each element in the named data-set is subtracted from its successor, and the result is replaced with a 1 if the difference represents an increase, or 0 if it represents a decrease. For example, the following numerical series produces the subsequent dichotomous series.

@values = (qw/3 4 7 6 5 1 2/);
@dicho =  (qw/1 1 0 0 0 1/);

Dichotomously, the data can be seen as commencing with an ascending run of length 2, followed by a descending run of length 3, and finishing with a short increase. Note that the number of resulting observations is less than the original number.

Note that the critical region of the distribution lies (only) in the upper-tail; a one-tailed test of significance is appropriate.

equal => 'gt|lt|rpt|0'

The default result when the difference between two successive values is zero is to skip the observation, and move onto the next succession (equal => 0). Alternatively, you may wish to repeat the result for the previous succession; skipping only a difference of zero should it occur as the first result (equal => 'rpt'). Or, a difference greater than or equal to zero is counted as an increase (equal => 'gt'), or a difference less than or equal to zero is counted as a decrease. For example,

@values =    (qw/3 3 7 6 5 2 2/);
@dicho_def = (qw/1 0 0 0/); # First and final results (of 3 - 3, and 2 - 2) are skipped
@dicho_rpt = (qw/1 0 0 0 0/); # First result (of 3 - 3) is skipped, and final result repeats the former
@dicho_gt =  (qw/1 1 0 0 0 1/); # Greater than or equal to zero is an increase
@dicho_lt =  (qw/0 1 0 0 0 0/); # Less than or equal to zero is a decrease

Numerical data: Two-sample dichotomisation

pool

$seq->pool('data' => ['blues', 'reds']);

Reduce two sets of cached numerical data to two categories in a single array by pooling the data according to the magnitude of the values at each trial. This is the typical option when using the Wald-Walfowitz test for determining a difference between two samples. Specifically, the values from both samples are pooled and ordered from lowest to highest, and then clustered into runs according to the sample from which neighbouring values come from. Another run occurs wherever there is a change in the source of the values. A non-random effect of, say, higher or lower values consistently coming from one sample rather than another, would be reflected in fewer runs than expected by chance. See the ex/checks.pl file in the installation distribution for a couple examples.

See also the methods for categorical data where it is ok to ignore any order and intervals in your numerical data.

Categorical data

match

$seq->match('data' => ['blues', 'reds']);

Reduce two lists of loaded data to two categories in a single array, according to the match between the elements at each index. Where the data-values are equal at a certain index, they will be represented with a 1; otherwise a 0. Numerical or stringy data can be equated. For example, the following two arrays would be reduced to the third, where a 1 indicates a match of identical values in the two data sources.

@blues = (qw/1 3 3 2 1 5 1 2 4/);
@reds =  (qw/4 3 1 2 1 4 2 2 4/);
@dicho = (qw/0 1 0 1 1 0 0 1 1/);

The following options may be specified.

data => [qw/blues reds/]

Specify, a referenced array, two named data-sets, as previously passed to load. An attempt to match a number of data-sets other than 2 will emit a croak.

lag => integer OR [integer, loop (boolean)] (where integer < number of observations or integer > -1 (number of observations) )

Match the two data-sets by shifting the first named set ahead or behind the other data-set by lag observations. The default is zero. For example, one data-set might be targets, and another responses to the targets:

targets   =	cbbbdacdbd
responses =	daadbadcce

Matched as a single sequence of hits (1) and misses (0) where lag = 0 yields (for the match on "a" in the 6th index of both arrays):

0000010000

If lag is set to +1, however, each response is associated with the target one ahead of the trial for which it was observed; i.e., each target is shifted to its +1 index. So the first element in the above responses (d) would be associated with the second element of the targets (b), and so on. Now, matching the two data-sets with a +1 lag gives two hits, of the 4th and 7th elements of the responses to the 5th and 8th elements of the targets, respectively:

000100100

Note that with a lag of zero, there were 3 runs of (a) hit/s or miss/es, but with a lag of 1, there were 5 runs.

Lag values can be negative, so that a -2 lag, for instance, will give a hit/miss series of:

00101010

Here, responses necessarily start at the third element (a), the first hits occurring when the fifth response-element corresponds to the the third target element (b).

In the above example, the last response (e) could not be used, and the number of elements in the hit/miss sequence became n-lag less the original target sequence. This means that the maximum value of lag must be one less the size of the data-sets, or there will be no data.

You can, alternatively, preserve all lagged data by looping any excess to the start or end of the criterion data. The number of observations will then always be the same, regardless of the lag. Matching the data in the example above with a lag of +1, with looping, creates an additional match between the final response and the first target (d):

1000100100

To effect looping, send a referenced list of the lag and a boolean for the loop, e.g., :

lag => [-3, 1]

Testing data

test

$seq->test('runs');
$seq->test('joins');
$seq->test('pot', state => 'a value appearing in the testdata');

$runs->test();
$joins->test(prob => 1/2);
$pot->test(state => 'circle');

Alias: process

When using a Statistics::Sequences class-object, this method requires specification of which test to perform, i.e., runs, joins or pot; the name of the test is simply given as the first argument. This is not required when the class-object already refers to one of the sub-modules, as created by the new method within Statistics::Sequences::Runs, Statistics::Sequences::Joins, and Statistics::Sequences::Pot.

General options

Options to test available to all the sub-package tests are as follows.

data => 'string'

Optionally specify the name of the data to be tested. By default, this is not required: the testdata are those that were last loaded, either anonymously, or as given to one of the dichotomising methods. Otherwise, if the data are already ready for testing in a dichotomous format, data that were previously loaded by name can be individually tested. For example, here are two sets of data that are loaded by name, and then a single test of one of them is performed.

@chimps = (qw/banana banana cheese banana cheese banana banana banana/);
@mice = (qw/banana cheese cheese cheese cheese cheese cheese cheese/);
$seq->load(chimps => \@chimps, mice => \@mice);
$seq->test('runs', data => 'chimps')->dump();
ccorr => boolean

Specify whether or not to perform the continuity-correction on the observed deviation. Default is false. See Statistics::Zed.

tails => 1|2

Specify whether the z-value is calculated for both sides of the normal distribution (2, the default for most testdata) or only one side (the default for data prepared with the swing method).

Test-specific required settings and options

Some sub-package tests need to have parameters defined in the call to test, and/or have specific options, as follows.

Joins : The Joins-test optionally allows the setting of a probability value; see test in the Statistics::Sequences::Joins manpage.

Pot : The Pot-test requires the setting of a state to be tested; see test in the Statistics::Sequences::Pot manpage.

Runs : There are presently no specific requirements nor options for the Runs-test.

Accessing results

All relevant statistical values are "lumped" into the class-object, and can be retrieved thus:

$seq->{'observed'} # The observed value of the test-statistic (Runs, Joins, Pot)
$seq->{'expected'} # The expected value of the test-statistic (Runs, Joins, Pot)
$seq->{'obs_dev'} # The observed deviation (observed minus expected values), continuity-corrected, if so specified
$seq->{'std_dev'} # The standard deviation
$seq->{'variance'} # Variance
$seq->{'z_value'} # The value of the z-statistic (ratio of observed to standard deviation)
$seq->{'p_value'} # The "normal probability" associated with the z-statistic

dump

$seq->dump(flag => '1|0', text => '0|1|2', precision_s => 'integer', precision_p => 'integer');

Alias: print_summary

Print results of the last-conducted test to STDOUT. By default, if no parameters to dump are passed, a single line of test statistics are printed. Options are as follows.

flag => boolean

If true, the p-value associated with the z-value is appended with a single asterisk if the value if below .05, and with two asterisks if it is below .01.

If false (default), nothing is appended to the p-value.

text => 0|1|2

If set to 1 (the default for an empty call to dump, a single line is printed, beginning with the name of the test, then the observed and expected values of the test-statistic, and the z-value and its associated p-value. The Pot-test, additionally, shows the state tested in parentheses after the test-name. For example:

Joins: expected = 400.00, observed = 360.00, z = -2.83, p = 0.0023389**
Runs: expected = 398.86, observed = 361.00, z = -2.70, p = 0.0070374**
Pot(1): expected = 288.51, observed = 303.63, z = 2.64, p = 0.0082920**

If set to anything greater than 1, more verbose info is printed: each of the above bits of info are printed, on separate lines, as well as the observed and standard deviations.

If set to zero, no statistics are printed ... This was useful at one point in development, and might become so again ...

precision_s => 'non-negative integer'

Precision of the z-statistic.

precision_p => 'non-negative integer'

Specify rounding of the probability associated with the z-value to so many digits. If zero or undefined, you get everything available.

dump_data

$seq->dump_data(delim => "\n")

Prints to STDOUT a space-separated line of the testdata - as dichotomised and put to test. Optionally, give a value for delim to specify how the datapoints should be separated.

string

$seq->string()

Returns a single line giving the z-value and p-value. Accepts the precision_s, precision_p and flag options, as for dump.

REFERENCES

Burdick, D. S., & Kelly, E. F. (1977). Statistical methods in parapsychological research. In B. B. Wolman (Ed.), Handbook of Parapsychology (pp. 81-130). New York, NY, US: Van Nostrand Reinhold. [Description of joins-test, with comparision to runs-test.]

Kelly, E. F. (1982). On grouping of hits in some exceptional psi performers. Journal of the American Society for Psychical Research, 76, 101-142. [Application of runs-test, with discussion of normality issue.]

Schmidt, H. (2000). A proposed measure for psi-induced bunching of randomly spaced events. Journal of Parapsychology, 64, 301-316. [Describes the pot-test.]

Swed, F., & Eisenhart, C. (1943). Tables for testing randomness of grouping in a sequence of alternatives. Annals of Mathematical Statistics, 14, 66-87. [Look in ex/checks.pl in the installation dist for a few examples from this paper for testing.]

Wald, A., & Wolfowitz, J. (1940). On a test whether two samples are from the same population. Annals of Mathematical Statistics, 11, 147-162. [Describes the runs-test.]

Wishart, J. & Hirshfeld, H. O. (1936). A theorem concerning the distribution of joins between line segments. Journal of the London Mathematical Society, 11, 227. [Describes the joins-test.]

Wolfowitz, J. (1943). On the theory of runs with some applications to quality control. Annals of Mathematical Statistics, 14, 280-288. [Suggests some ways in which data may be dichotomised for testing runs; implemented here.]

SEE ALSO

Statistics::Burst : Another test of sequences.

TO DO/BUGS

Results are dubious if there are only two observations.

Testing not by z-scores, and/or using poisson distribution for low number of observations

Fu's Markovian solution

Multivariate extensions

Sort option for pool method ?

REVISIONS

The series testing methods (series_init, series_update and series_test) have been moved to Statistics::Zed as of v0.03.

See CHANGES file in installation dist.

AUTHOR/LICENSE

rgarton AT cpan DOT org

This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).

Disclaimer

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.