NAME

Statistics::RankCorrelation - Compute the rank correlation between two vectors

SYNOPSIS

use Statistics::RankCorrelation;

$x = [ 8, 7, 6, 5, 4, 3, 2, 1 ];
$y = [ 2, 1, 5, 3, 4, 7, 8, 6 ];

$c = Statistics::RankCorrelation->new( $x, $y, sorted => 1 );

$n = $c->spearman;
$t = $c->kendall;
$m = $c->csim;

$s = $c->size;
$xd = $c->x_data;
$yd = $c->y_data;
$xr = $c->x_rank;
$yr = $c->y_rank;
$xt = $c->x_ties;
$yt = $c->y_ties;

DESCRIPTION

This module computes rank correlation coefficient measures between two sample vectors.

Examples can be found in the distribution eg/ directory and the test file. The FUNCTIONS section below has to use when computing sorted rank coefficients by hand.

* IMPORTANT NOTE *

This module does not compute correct results for Kendall's Tau with tied ranks. I am working on this and have failing tests to prove it.

METHODS

new

$c = Statistics::RankCorrelation->new( \@u, \@v );

This method constructs a new Statistics::RankCorrelation object.

If given two numeric vectors (as array references), the object is initialized by computing the statistical ranks of the vectors. If they are of different cardinality the shorter vector is first padded with trailing zeros.

If the <sorted flag is set, the bivariate data set is sorted by the first (x) vector.

x_data

$c->x_data( $y );
$x = $c->x_data;

Return and set the one dimensional array reference data. This is the "unit" array, used as a reference for size and iteration.

y_data

$c->y_data( $y );
$x = $c->y_data;

Return and set the one dimensional array reference data. This vector is dependent on the x vector.

size

$c->size( $s );
$s = $c->size;

Return and set the number of array elements.

x_rank

$c->x_rank( $rx );
$rx = $c->x_rank;

Return and set the ranks as an array reference.

y_rank

$ry = $c->y_rank;
$c->y_rank( $ry );

Return and set the ranks as an array reference.

x_ties

$xt = $c->x_ties;
$c->x_ties( $xt );

Return and set the ties as a hash reference.

y_ties

$yt = $c->y_ties;
$c->y_ties( $yt );

Return and set the ties as a hash reference.

spearman

$n = $c->spearman;

Spearman's rho rank-order correlation is a nonparametric measure of association based on the rank of the data values and is a special case of the Pearson product-moment correlation.

    6 * sum( (xi - yi)^2 )
1 - --------------------------
           n^3 - n

Where x and y are the two rank vectors and i is an index from one to n number of samples.

kendall

$t = $c->kendall;

       c - d
t = -------------
    n (n - 1) / 2

Where c and c are the number of concordant and discordant pairs and n is the number of samples. If there are tied pairs, a different (more complicated) denominator is used.

csim

$n = $c->csim;

Return the contour similarity index measure. This is a single dimensional measure of the similarity between two vectors.

This returns a measure in the (inclusive) range [-1..1] and is computed using matrices of binary data representing "higher or lower" values in the original vectors.

This measure has been studied in musical contour analysis.

FUNCTIONS

rank

$v = [qw(1 3.2 2.1 3.2 3.2 4.3)];
$ranks = rank($v);
# [1, 4, 2, 4, 4, 6]
my( $ranks, $ties ) = rank($v);
# [1, 4, 2, 4, 4, 6], { 1=>[], 3.2=>[]}

Return an list of an array reference of the ordinal ranks and a hash reference of the tied data.

In the case of a tie in the data (identical values) the rank numbers are averaged. An example will elucidate:

sorted data:    [ 1.0, 2.1, 3.2, 3.2, 3.2, 4.3 ]
ranks:          [ 1,   2,   3,   4,   5,   6   ]
tied ranks:     3, 4, and 5
tied average:   (3 + 4 + 5) / 3 == 4
averaged ranks: [ 1,   2,   4,   4,   4,   6   ]

pad_vectors

( $u, $v ) = pad_vectors( [ 1, 2, 3, 4 ], [ 9, 8 ] );
# [1, 2, 3, 4], [9, 8, 0, 0]

Append zeros to either input vector for all values in the other that do not have a corresponding value. That is, "pad" the tail of the shorter vector with zero values.

co_sort

( $u, $v ) = co_sort( $u, $v );

Sort the vectors as two dimensional data-point pairs with u values sorted first.

correlation_matrix

$matrix = correlation_matrix( $u );

Return the correlation matrix for a single vector.

This function builds a square, binary matrix that represents "higher or lower" value within the vector itself.

sign

Return 0, 1 or -1 given a number.

TO DO

Handle any number of vectors instead of just two.

Implement other rank correlation measures that are out there...

SEE ALSO

For the csim method:

http://www2.mdanderson.org/app/ilya/Publications/JNMRcontour.pdf

For the spearman and kendall methods:

http://mathworld.wolfram.com/SpearmanRankCorrelationCoefficient.html

http://en.wikipedia.org/wiki/Kendall's_tau

THANK YOU

Thomas Breslin <thomas@thep.lu.se>, Jerome <jerome.hert@free.fr>, Jon Schutz <Jon.Schutz@youramigo.com> and Andy Lee <yikes2000@yahoo.com>

AUTHOR AND COPYRIGHT

Gene Boggs <gene@cpan.org>

Copyright 2009, Gene Boggs, All Rights Reserved.

LICENSE

This program is free software; you can redistribute or modify it under the same terms as Perl itself.