NAME
Statistics::Descriptive::Weighted - Module of basic descriptive statistical functions for weighted variates.
SYNOPSIS
use Statistics::Descriptive::Weighted;
$stat = Statistics::Descriptive::Weighted::Full->new();
$stat->add_data([1,2,3,4],[0.1,1,10,100]); ## weights are in second argument
$mean = $stat->mean(); ## weighted mean
$var = $stat->variance(); ## weighted sample variance
$stat->add_data([3],[10]); ## statistics are updated as variates are added
$vwt = $stat->weight(3); ## returns 20, the weight of 3
$wt = $stat->weight(); ## returns sum of weights, 121.1
$ct = $stat->count(3); ## returns 2, the number of times 3 was observed
$ct = $stat->count(); ## returns 5, the total number of observations
$med = $stat->median(); ## weighted sample median
$mode = $stat->mode(); ## returns 4, value with the most weight
$ptl = $stat->quantile(.01); ## returns 3, smallest value with cdf >= 1st %ile
$ptl = $stat->percentile(1); ## returns about 2.09, obtained by interpolation
$cdf = $stat->cdf(3); ## returns ECDF of 3 (about 17.4%)
$cdf = $stat->cdf(3.5); ## returns ECDF of 3.5 (about 17.4%, same as ECDF of 3)
$sf = $stat->survival(3); ## returns complement of ECDF(3) (about 82.6%)
$p-val = $stat->rtp(4); ## returns right tail probability of 4 (100 / 121.1)
$min = $stat->min(); ## returns 1, the minimum
$max = $stat->max(); ## returns 4, the maximum
$unweighted = Statistics::Descriptive::Full->new();
$weighted = Statistics::Descriptive::Weighted::Full->new();
$unweighted->add_data(1,1,1,1,7,7,7,7);
$weighted->add_data([1,7],[4,4]);
$ct = $unweighted->count(); ## returns 8
$ct = $weighted->count(); ## returns 2
print "false, variances unequal!\n" unless
( abs($unweighted->variance() - $weighted->variance()) < 1e-12 );
## the above statement will print, the variances are truly unequal
## both variances correct for sample-size based on their variate counts
$data = $weighted->get_data();
## the above statement returns a hashref with keys:
## 'vars','weights','uniqvars','counts','sumweights','cdfs','rtps','percentiles','order'
$weighted->print();
## prints the hashref above with Data::Dumper
DESCRIPTION
This module partially extends the module Statistics::Descriptive to handle weighted variates. Like that module, this module has an object-oriented design and supports two different types of data storage and calculation objects: sparse and full. With the sparse object representation, none of the data is stored and only a few statistical measures are available. Using the full object representation, complete information about the dataset (including order of observation) is retained and additional functions are available.
This module represents numbers in the same way Perl does on your architecture, relying on Perl's own warnings and assertions regarding underflow and overflow errors, division by zero, etc. The constant $Statistics::Descriptive::Tolerance
is not used. Caveat programmor.
Like in Statistics::Descriptive any of the methods (both Sparse and Full) cache values so that subsequent calls with the same arguments are faster.
Be warned that this is not a drop-in replacement for Statistics::Descriptive. The interfaces are different for adding data, and also for retrieving data with get_data. Certain functions from Statistics::Descriptive have been dropped, specifically:
- Statistics::Descriptive::Sparse::mindex()
- Statistics::Descriptive::Sparse::maxdex()
- Statistics::Descriptive::Full::sort_data()
- Statistics::Descriptive::Full::presorted()
- Statistics::Descriptive::Full::harmonic_mean()
- Statistics::Descriptive::Full::geometric_mean()
- Statistics::Descriptive::Full::trimmed_mean()
- Statistics::Descriptive::Full::frequency_distribution()
- Statistics::Descriptive::Full::least_squares_fit()
Calling these functions on Statistics::Descriptive::Weighted objects will cause programs to die with a stack backtrace.
With this module you can recover the data sorted from get_data(). Data is sorted automatically on insertion.
The main extension and focus of this module was to implement a cumulative distribution function and a right-tail probability function with efficient search performance, even if the data added is already sorted. This is achieved using a partially randomized self-balancing tree to store data. The implementation uses Tree::Treap v. 0.02 written by Andrew Johnson.
METHODS
Sparse Methods
- $stat = Statistics::Descriptive::Sparse->new();
-
Create a new sparse statistics object.
- $stat->add_data([1,2,3],[11,9,2]);
-
Adds data to the statistics object. The cached statistical values are updated automatically.
This function expects two array references: the first points to variates and the second to their corresponding weights. The referenced arrays must be of equal lengths. The weights are expected to all be positive. If any weights are not positive, the module will carp (complain to standard error) and the corresponding variates will be skipped over.
Variates may be added in multiple instances to Statistics objects, and their summaries are calculated "on-line," that is updated.
- $stat->count();
-
Returns the number of variates that have been added.
- $stat->weight();
-
Returns the sum of the weight of the variates.
- $stat->sum();
-
Returns the sum of the variates multiplied by their weights.
- $stat->mean();
-
Returns the weighted mean of the data. This is the sum of the weighted data divided by the sum of weights.
- $stat->variance();
-
Returns the weighted variance of the data. The "on-line" weighted incremental algorithm for variance is based on D. H. D. West (1979). Communications of the ACM, 22, 9, 532-535: Updating Mean and Variance Estimates: An Improved Method. Division by n-1 is used, where "n" refers to the count, or number of variates added. See SYNOPSIS above.
- $stat->standard_deviation();
-
Returns the weighted standard deviation of the data. Division by n-1 is used.
- $stat->min();
-
Returns the minimum value of the data set.
- $stat->max();
-
Returns the maximum value of the data set.
- $stat->sample_range();
-
Returns the sample range (max - min) of the data set.
Full Methods
Similar to the Sparse Methods above, any Full Method that is called caches the current result so that it doesn't have to be recalculated.
- $stat = Statistics::Descriptive::Full->new();
-
Create a new statistics object that inherits from Statistics::Descriptive::Sparse so that it contains all the methods described above.
- $stat->add_data([1,2,4,5],[2,2,2,5]);
-
Adds weighted data to the statistics object. All of the sparse statistical values are updated and cached. Cached values from Full methods are deleted since they are no longer valid.
Note: Calling add_data with an empty array will delete all of your Full method cached values! Cached values for the sparse methods are not changed
- $stat->mode();
-
Returns the data value with the most weight. In the case that a data value is observed multiple times, their successive weights are summed of course.
- $stat->maxweight();
-
The weight of the mode.
- $stat->count(10);
-
The number of observations of a particular data value.
- $stat->weight(10);
-
The total weight of a particular data value.
- $x = $stat->cdf(4);
-
Returns the weighted empirical cumulative distribution function (ECDF).
For example, given the 6 measurements:
-2, 7, 7, 4, 18, -5
with weights:
2, 1, 1, 2, 2, 2
Let F(x) be the ECDF of x, which is defined as the sum of all normalized weights of all observed variates less than or equal to x.
Then F(-8) = 0, F(-5.0001) = 0, F(-5) = 1/5, F(-4.999) = 1/5, F(7) = 4/5, F(18) = 1, F(239) = 1.
Note that we can recover the different measured values and how many times each occurred from F(x) -- no information regarding the range in values is lost. Summarizing measurements using histograms, on the other hand, in general loses information about the different values observed, so the EDF is preferred.
Using either the EDF or a histogram, however, we do lose information regarding the order in which the values were observed. Whether this loss is potentially significant will depend on the metric being measured.
(Modified from: pod from Statistics::Descriptive, itself taken from RFC2330 - Framework for IP Performance Metrics, Section 11.3. Defining Statistical Distributions. RFC2330 is available from: http://www.cis.ohio-state.edu/htbin/rfc/rfc2330.html.)
- $x = $stat->survival(8);
-
Complement of the weighted cdf function, also known as the weighted survival function. The weighted survival function S(x) is the sum of all normalized weights of all observed variates greater than x.
For example, given the 6 measurements:
-2, 7, 7, 4, 18, -5
with weights:
2, 1, 1, 2, 2, 2
Then S(-8) = 1, S(-5.0001) = 1, S(-5) = 4/5, S(-4.999) = 4/5, S(7) = 1/5, S(18) = 0, S(239) = 0.
- $x = $stat->rtp(8);
-
The weighted right tail probability function. The weighted right tail probability function P(x) is the sum of all normalized weights of all observed variates greater than or equal to x. This may be useful for Monte Carlo estimation of P-values.
For example, given the 6 measurements:
-2, 7, 7, 4, 18, -5
with weights:
2, 1, 1, 2, 2, 2
Then P(-8) = 1, P(-5.0001) = 1, P(-5) = 1, P(-4.999) = 4/5, P(7) = 2/5, P(18) = 1/5, P(239) = 0.
- $x = $stat->quantile(0.25);
-
Returns the weighted quantile. This is the inverse of the weighted ECDF function. It is only defined for arguments between 0 and 1 inclusively. If F(x) is the ECDF, then the weighted quantile function G(y) returns the smallest variate x whose weighted ECDF F(x) is greater than or equal to y.
For example, given the 6 measurements:
-2, 7, 7, 4, 18, -5
with weights:
2, 1, 1, 2, 2, 2
Then G(0) = -5, G(0.1) = -5, G(0.2) = -5, G(0.25) = -2, G(0.4) = -2, G(0.8) = 7, G(1) = 18.
- $x = $stat->percentile(25);
-
Returns the weighted percentile. It is only defined for arguments between 0 and 100 inclusively. Unlike the quantile function above, the percentile function performs weighted linear interpolation between variates unless the argument exactly equals the computed percentile of one of the variates.
Define p_n to be the percentile of the nth sorted variate, written v_n, like so:
p_n = 100/S_N * (S_n - (w_n / 2)),
where S_N is the sum of all weights, S_n is the partial sum of weights up to and including the nth variate, and w_n is the weight of the nth variate.
Given a percent value 0 <= y <= 100, find an integer k such that:
p_k <= y <= p_(k+1).
The percentile function P(y) may now be defined:
P(y) = v_k + {[(y - p_k) / (p_(k+1) - p_k)] * (v_(k+1) - v_k)}
This definition of weighted percentile was taken from: http://en.wikipedia.org/wiki/Percentile on Dec 15, 2008.
- $stat->median();
-
This is calculated as $stat->percentile(50) and cached as necessary.
- $stat->get_data();
-
Returns a data structure that reconstitutes the original data added to the object, supplemented by some of the distributional summaries. Returns a reference to a hash, with the following keys, each pointing to a reference to an array containing the indicated data.
- vars
-
The observed variates, sorted.
- weights
-
The weights of the variates (in corresponding order to the value of 'vars').
- order
-
The order of addition of the variates (in corresponding order to the value of 'vars').
- uniqvars
-
The uniquely observed variates, sorted.
- counts
-
The numbers of times each variate was observed (in corresponding order to the value of 'uniqvars').
- sumweights
-
The total weight of each unique variate (in corresponding order to the value of 'uniqvars').
- cdfs
-
The cdf of each unique variate (in corresponding order to the value of 'uniqvars').
- rtps
-
The rt tail probabilities of each unique variate (in corresponding order to the value of 'uniqvars').
- percentiles
-
The percentiles of each unique variate (see "percentile" above for definition, given in corresponding order to the value of 'uniqvars').
- $stat->print();
-
Prints a Data::Dumper dump of the hashref returned by get_data().
REPORTING ERRORS
When reporting errors, please include the following to help me out:
Your version of perl. This can be obtained by typing perl
-v
at the command line.Which versions of Statistics::Descriptive and Statistics::Descriptive::Weighted you're using.
Details about what the error is. Try to narrow down the scope of the problem and send me code that I can run to verify and track it down.
AUTHOR
David H. Ardell
dhard@cpan.org (or Google works).
REFERENCES
RFC2330, Framework for IP Performance Metrics
D. H. D. West (1979). Communications of the ACM, 22, 9, 532-535: Updating Mean and Variance Estimates: An Improved Method.
Tree::Treap Copyright 2002-2005 Andrew Johnson. http://stuff.siaris.net
COPYRIGHT
Copyright (c) 2008 David H. Ardell. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Portions of this code is from Statistics::Descriptive which is under the following copyrights.
Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Copyright (c) 1994,1995 Jason Kastner. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.