NAME

PDLx::Bin1D - One Dimensional Binning Utilities

VERSION

version 0.24

SYNOPSIS

use PDL;
use PDLx::Bin1D;

DESCRIPTION

One dimensional binning functions,

  • binning up to a given S/N

  • optimized one-pass robust statistics

All functions are made available in the PDLx::Bin1D namespace.

SUBROUTINES

bin_adaptive_snr

%hash = bin_adaptive_snr( %options  );

Adaptively bin a data set to achieve a minimum signal to noise ratio in each bin.

This routine ignores data with bad values or with errors that have bad values.

bin_adaptive_snr groups data into bins such that each bin meets one or more conditions:

  • a minimum signal to noise ratio (S/N).

  • A minimum number of data elements (optional).

  • A maximum number of data elements (optional).

  • A maximum data width (see below) (optional).

  • A minimum data width (see below) (optional).

The data are typically dependent values (e.g. flux as a function of energy or counts as a function of radius). The data should be sorted by the independent variable (e.g. energy or radius).

Calculation of the S/N requires an estimate of the error associated with each datum. The error may be provided or may be estimated from the population using either the number of data elements in a bin (e.g. Poisson errors) or the standard deviation of the signal in a bin. If errors are provided, they may be used to weight the population standard deviation or may be added in quadrature.

Binning begins at the start of the signal vector. Data are accumulated into a bin until one or more of the possible criteria is met. If the final bin does not meet the required criteria, it may optionally be successively folded into preceding bins until the final bin passes the criteria or there are no bins left.

Each datum may be assigned an extra parameter, its width, which is summed for each bin, and can be used as an additional constraint on bin membership.

Parameters

bin_adaptive_snr is passed a hash or a reference to a hash containing its parameters. The available parameters are:

signal

A piddle containing the signal data. This is required.

error

A piddle with the error for signal datum. Optional.

width

A piddle with the width of each element of the signal. Optional.

error_algo

A string indicating how the error is to be handled or calculated. It may be have one of the following values:

  • poisson

    Poisson errors will be calculated based upon the number of elements in a bin,

    error**2 = N

    Any input errors are ignored.

  • sdev

    The error is the population standard deviation of the signal in a bin.

    error**2 = Sum [ ( signal - mean ) **2 ] / ( N - 1 )

    If errors are provided, they are used to calculated the weighted population standard deviation.

    error**2 = ( Sum [ (signal/error)**2 ] / Sum [ 1/error**2 ] - mean**2 )
               * N / ( N - 1 )
  • rss

    Errors must be provided; the errors of elements in a bin are added in quadrature.

min_snr

The minimum signal to noise ratio to be achieved in each bin. Required.

min_nelem
max_nelem

The minimum and/or maximum number of elements to be achieved in each bin. Optional

min_width
max_width

The minimum and/or maximum width of the elements to be achieved in each bin. Optional.

fold boolean

If true, the last bin may be folded into the preceding bin in order to ensure that the last bin meets one or more of the criteria. It defaults to false.

Results

bin_adaptive_snr returns a hashref with the following entries:

index

A piddle containing the bin indices for the elements in the input data piddle. Data which were skipped because of bad values will have their index set to the bad value.

nbins

A piddle containing the number of bins which spanned the range of the input data.

signal

A piddle containing the sum of the data values in each bin. Only indices 0 through nbins -1 are valid.

nelem

A piddle containing the number of data elements in each bin. Only indices 0 through nbins -1 are valid.

error

A piddle containing the errors in each bin, calculated using the algorithm specified via error_algo. Only indices 0 through nbins -1 are valid.

mean

A piddle containing the weighted mean of the signal in each bin. Only indices 0 through nbins -1 are valid.

ifirst

A piddle containing the index into the input data piddle of the first data value in a bin. Only indices 0 through nbins -1 are valid.

ilast

A piddle containing the index into the input data piddle of the last data value in a bin. Only indices 0 through nbins -1 are valid.

rc

A piddle containing a results code for each output bin. Only indices 0 through nbins -1 are valid. The code is the bitwise "or" of the following constants (available in the PDLx::Bin1D namespace)

BIN_RC_OK

The bin met the minimum S/N, data element count and weight requirements

BIN_RC_GEWMAX

The bin weight was greater or equal to that requested.

BIN_RC_GENMAX

The number of data elements was greater or equal to that requested.

BIN_RC_FOLDED

The bin is the result of folding bins at the end of the bin vector to achieve a minimum S/N.

BIN_RC_GTMINSN

The bin accumulated more data elements than was necessary to meet the S/N requirements. This results from constraints on the minimum number of data elements or bin weight.

bin_on_index

$hashref = bin_on_index( %pars  );

Calculate basic statistics on optionally weighted, binned data using a provided index. The input data are assumed to be one-dimensional; extra dimensions are threaded over.

This routine ignores data with bad values or with weights that have bad values.

Description

When generating statistics for multiple-component data binned on a common component, it's more efficient to calculate a bin index for the common component and then use it to generate statistics for each component.

For example, if a time dependent stream of events is binned into time intervals, statistics of the events' properties (such as position or energy) must be evaluated on data binned on the intervals.

Some statistics (such as the summed squared deviation from the mean) may be calculated in two passes over the data, but may suffer from numerical inaccuracies depending upon the magnitude of the intermediate values.

bin_on_index uses numerically stable algorithms to calculate various statistics on binned data in a single pass. It uses an index piddle to assign data to bins and calculates basic statistics on the data in those bins. Data may have associated weights, and the index piddle need not be sorted. It by default ignores out-of-bounds data, but can be directed to operate on them.

The statistics which are returned for each bin are:

  • The number of elements, N

  • The (weighted) sum of the data, Sum(x_i), Sum(w_i * x_i).

  • The (weighted) mean of the data, Sum(x_i) / N, Sum(w_i * x_i) / Sum(w_i)

  • The sum of the weights, Sum(w_i)

  • The sum of the square of the weights, Sum(w_i^2)

  • The sum of the (weighted) squared deviation of the data from mean, Sum( (x_i-u)^2 ), Sum( w_i(x_i-u)^2 ). These are not normalized!.

  • The minimum and maximum data values

The sum of the squared deviations are not normalized, to allow the user to handle it according to their needs.

For unweighted data, the typical normalization factor is N-1, while for weighted data, the normalization factor varies depending upon whether the weights represent errors or quality weights. In the former case, where w_i = 1 / sigma^2, the normalization factor is typically N / (N - 1) / Sum(w_i), while for the latter the normalization is typically Sum(w_i) / ( Sum(w_i)^2 - Sum(w_i^2). See "[Robinson]" and "[Bevington]".

The algorithms used are chosen for their numerical stability. Sums are computed using Kahan summation and the mean and squared deviations are calculated with stable updating algorithms. See "[Chan83]", "[West79]", "[Higham]".

Threading

The parameters "data", "index", "weight", "imin", "imax", and "nbins" are threaded over. This keeps things quite flexible, as one can specialize things for complex datasets or keep them simple by specifying the minimum information for a given parameter to have it remain constant over the threaded dimensions. (Note that data must have the same shape or be a superset of the other parameters).

Unless otherwise specified, the term extents refers to those of the core (non-threaded) one-dimensional slices of the input and output piddles.

For some parameters there is value in applying an algorithm to either the entire dataset (including threaded dimensions) or just the core one-dimensional data. For example, the "range" option can indicate that the in-bounds bins are defined by the range in each one-dimensional slice of "index" or in "index" as a whole.

Minimum Bin Index, Number of Bins, Out-Of-Bounds Data

By default, the minimum bin index and the number of bins are determined from the values in the "index" piddle, treating it as if it were one-dimensional. Statistics for data with the minimum index are stored in the results piddles at index 0.

The caller may use the options "imin", "imax", and "nbins", and "range" to change how the index range is mapped onto the results piddles, and whether the range should be specific to each one-dimensional slice of "index". If none of these are specified, the default is equivalent to setting "range" to flat,minmax. To most efficiently store the statistics, set "range" to slice,minmax.

Data with indices outside of the range [$imin, $imin + $nbins - 1] are treated as out-of-bounds data, and by default are ignored. If the user wishes to accumulate statistics on these data, the "oob" option may be used to indicate where in the results piddles the statistics should be stored.

Parameters

bin_on_index is passed a hash or a reference to a hash containing its parameters. The possible parameters are:

data piddle

The data. Required

index piddle

The bin index. It should be of type indx, but will be converted to that if not. It must be thread compatible with "data". Required

weight piddle

Data weight. It must be thread compatible with "data". Optional

nbins integer | piddle

The number of bins. It must be thread compatible with "data".

If nbins is set and neither of imin or imax are set, imin is set to 0.

Use "range" for more control over automatic determination of the range.

imin integer | piddle

The index value associated with the first in-bounds index in the result piddle. It must be thread compatible with "data".

If imin is set and neither of nbins or imax are set, imax is set to $index->max.

Use "range" for more control over automatic determination of the range.

imax integer | piddle

The index value associated with the last in-bounds index in the result piddle. It must be thread compatible with "data".

If imax is set and neither of nbins or imin are set, imin is set to 0.

Use "range" for more control over automatic determination of the range.

range "spec,spec,..."

Determine the in-bounds range of indices from "index". The value is a string containing a list of comma separated specifications.

The first element must be one of the following values:

flat

Treat "index" as one-dimensional, and determine a single range which covers it.

slice

Determine a separate range for each one-dimensional slice of "index".

It may optionally be followed by one of the following

minmax

Determine the full range (i.e. both minimum and maximum) from "index". This is the default. Do not specify "nbins", "imax", or "imin".

min

Determine the minimum of the range from "index". Specify only one of "nbins" or "imax".

max

Determine the maximum of the range from "index". Specify only one of "nbins" or "imin".

oob

An index is out-of-bounds if it is not within the range

[ $imin, $imin + $nbins )

By default, it will be ignored.

This option specifies where in the results piddles the out-of-bound data statistics should be stored. It may be one of:

start-end

The extent of the results piddle is increased by two, and out-of-bound statistics are written as follows:

$index - $imin < 0          ==> $result->at(0)
$index - $imin >= $nbins    ==> $result->at(-1)
start-nbins

The extent of the results piddle is increased by two, and out-of-bound statistics are written as follows:

$index - $imin < 0          ==> $result->at(0)
$index - $imin >= $nbins    ==> $result->at($nbins)

This differs from start-end if $nbins is different for each one-dimensional slice.

end

The extent of the results piddle is increased by two, and out-of-bound statistics are written as follows:

$index - $imin < 0          ==> $result->at(-2)
$index - $imin >=  $nbins   ==> $result->at(-1)
nbins

The extent of the results piddle is increased by two, and out-of-bound statistics are written as follows:

$index - $imin < 0          ==> $result->at($nbins-1)
$index - $imin >=  $nbins   ==> $result->at($nbins)

This differs from end if $nbins is different for each one-dimensional slice.

boolean

If false (the default) out-of-bounds data are ignored. If true, it is equivalent to specifying "start-end"

want_sum_weight boolean [false]

if true, the sum of the bins' weights are calculated.

want_sum_weight2 boolean [false]

if true, the sum of square of the bins' weights are calculated.

Results

bin_on_index returns a reference which may be used either a hash or object reference.

The keys (or methods) and their values are as follows:

data

A piddle containing the (possibly weighted) sum of the data in each bin.

count

A piddle containing the number of data elements in each bin.

weight

A piddle containing the sum of the weights in each bin (only if "weight" was provided and "want_sum_weight" specified).

weight2

A piddle containing the sum of the square of the weights in each bin (only if "weight" was provided and "want_sum_weight2" specified)..

mean

A piddle containing the (possibly weighted) mean of the data in each bin.

dmin
dmax

Piddles containing the minimum and maximum values of the data in each bin.

imin
nbins

The index value associated with the first in-bounds index in the statistics piddles, and the number of in-bounds bins.

References

[Chan83]

Chan, Tony F., Golub, Gene H., and LeVeque, Randall J., Algorithms for Computing the Sample Variance: Analysis and Recommendations, The American Statistician, August 1983, Vol. 37, No.3, p 242.

[West79]

D. H. D. West. 1979. Updating mean and variance estimates: an improved method. Commun. ACM 22, 9 (September 1979), 532-535.

[Higham]

Higham, N. (1996). Accuracy and stability of numerical algorithms. Philadelphia: Society for Industrial and Applied Mathematics.

[Robinson]

Robinson, E.L., Data Analysis for Scientists and Engineers, 2016, Princeton University Press. ISBN 978-0-691-16992-7.

[Bevington]

Bevington, P.R., Robinson, D.K., Data Reduction and Error Analysis for the Physical Sciences, Second Edition, 1992, McGraw-Hill, ISBN 0-07-91243-9.

BUGS AND LIMITATIONS

No bugs have been reported.

AUTHOR

Diab Jerius, <djerius@cpan.org>

COPYRIGHT AND LICENSE

Copyright 2008 Smithsonian Astrophysical Observatory

PDLx::Bin1D is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

SUPPORT

Bugs

Please report any bugs or feature requests to bug-pdlx-bin1d@rt.cpan.org or through the web interface at: https://rt.cpan.org/Public/Dist/Display.html?Name=PDLx-Bin1D

Source

Source is available at

https://gitlab.com/djerius/pdlx-bin1d

and may be cloned from

https://gitlab.com/djerius/pdlx-bin1d.git

SEE ALSO

Please see those modules/websites for more information related to this module.

AUTHOR

Diab Jerius <djerius@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2021 by Smithsonian Astrophysical Observatory.

This is free software, licensed under:

The GNU General Public License, Version 3, June 2007