NAME

Catmandu::Exporter::Stat - a statistical export

SYNOPSIS

# Calculate statistics on the availabity of the ISBN fields in the dataset
cat data.json | catmandu convert -v JSON to Stat --fields isbn

# Calculate statistics on the uniqueness of ISBN numbers in the dataset
cat data.json | catmandu convert -v JSON to Stat --fields isbn --values 1

# Export the statistics as YAML
cat data.json | catmandu convert -v JSON to Stat --fields isbn --values 1 --as YAML

DESCRIPTION

The Catmandu::Stat package can be used to calculate statistics on the availablity of fields in a data file. Use this exporter to count the availability of fields or count the number of duplicate values. For each field the exporter calculates the following statistics:

* name    : the name of a field
* count   : the number of non-zero occurences of a field in all records
* zeros   : the number of records without a field
* zeros%  : the percentage of records without a field
* min     : the minimum number of occurences of a field in any record
* max     : the maximum number of occurences of a field in any record
* mean    : the mean number of occurences of a field in all records
* median  : the median number of occurences of a field in all records
* mode    : the most common number of occurences of a field in all records
* variance : the variance of the field number
* stdev   : the standard deviation of the field number
* uniq    : the number of uniq values
* entropy : the minimum and maximum entropy in the field values

In case of values:

* count   : the number of non-zero values found in all records
* zeros   : the number of values which are mull or undefined
* zeros%  : the percentage of values which are undefined
* min     : the minimum number of occurences of a value in all records
* max     : the maximum number of occurences of a value in any records
* mean    : the mean number of occurences of a value in all records
* median  : the median number of occurences of a value in all records
* variance : the variance of the value occurence number
* stdev   : the standard deviation of the value occurenve number
* uniq    : the number of uniq values
* entropy : the minimum and maximum entropy in the field values

Details:

* entropy is an indication in the variation of field values (are some values more unique than others)
* entropy values displayed as : minimum/maximum entropy
* when the minimum entropy = 0, then all the field values are equal
* when the minimum and maximum entropy are equal, then all the field values are different

CONFIGURATION

v

Verbose output. Show the processing speed.

fix FIX

A fix or a fix file containing one or more fixes applied to the input data before the statistics are calculated.

fields KEY[,KEY,...]

One or more fields in the data for which statistics need to be calculated. No deep nested fields are allowed. The exporter will collect statistics on the availability of a field in all records. For instance, the following record contains one 'title' field, zero 'isbn' fields and 3 'author' fields

---
title: ABCDEF
author: 
    - Davis, Miles
    - Parker, Charly
    - Mingus, Charles
year: 1950

Examples of operation:

# Calculate statistics on the number of records that contain a 'title' 
cat data.json | catmandu convert JSON to Stat --fields title

# Calculate statistics on the number of records that contain a 'title', 'isbn' or 'subject' fields 
cat data.json | catmandu convert JSON to Stat --fields title,isbn,subject

# The next example will not work: no deeply nested fields allowed
cat data.json | catmandu convert JSON to Stat --fields foo.bar.x.y

When no fields parameter is available, then all fields are read from the first input record.

values 0 | 1

When the value option is activated, then the statistics are calculated on the contents of the fields instead of the availability of fields. Use this option to calculate statistics on duplicate field values. For instance in the follow example, the title field has 2 duplicates, the author field has zero duplicates. The year field is available in 2 out of 3 records, but in only one record (33%) it contains a value.

---
title: ABC
author: 
    - Test
    - Test2
---
title: ABC
author:
    - Test3
year: ''
---
title: DEF
year: 1980

Examples of operation:

# Calculate statistics on the uniqueness of ISBN numbers in the dataset
cat data.json | catmandu convert JSON to Stat --fields isbn --values 1
as Table | CSV | YAML | JSON | ...

By default the statistics are exported in a CSV format. The use 'as' option to change the export format.

SEE ALSO

Catmandu::Exporter , Statistics::Basic