NAME

GHCN::StationTable - collect station objects and weather data

SYNOPSIS

use GHCN::StationTable;

my $ghcn = GHCN::StationTable->new;

my ($opt, @errors) = $ghcn->set_options(
  user_options => {
      country     => 'US',
      state       => 'NY',
      location    => 'New York',
      report      => 'yearly',
  },
);
die @errors if @errors;

$ghcn->load_stations;

# generate a list of the stations that were selected
say $ghcn->get_stations( kept => 1 );

if ($opt->report) {
    say $ghcn->get_header;

    $ghcn->load_data();
    $ghcn->summarize_data;

    say $ghcn->get_summary_data;
    say $ghcn->get_footer;
}

DESCRIPTION

The GHCN::StationTable module provides a class that is used to fetch stations information from the NOAA Global Historical Climatology Network database, along with temperature and/or precipitation records from the daily historical records.

For a more comprehensive example than the above Synopsis, see the section EXAMPLE PROGRAM.

Caveat emptor: incompatible interface changes may occur on releases prior to v1.00.000. (See VERSIONING and COMPATIBILITY.)

The module is primarily for use by modules GHCN::Fetch.

FIELD ACCESSORS

opt_obj

Returns a reference to the Options object created by set_options.

opt_href

Returns a reference to a hash of the Options created by set_options.

config_file

Returns the name of the configuration file, if one was passed to set_options.

config_href

Returns a reference to a hash containing the configuration options set by set_options (if any).

stn_count

Returns a count of the total number of stations found in the station list.

stn_selected_count

Returns a count of the number of stations that were selected for processng.

stn_filtered_count

Returns a count of the number of stations that were selected for processing, excluding those rejected due to errors or other criteria.

missing_href

Returns a hash of the missing months and days for the selected data.

METHODS

new ()

Create a new StationTable object.

export_kml( list => 0 )

Output the coordinates of the station collection as a KML file, for import into Google Earth as placemarks. The active range of each station will be included as timespans so that you can view the placemarks across time.

argument: list

If the argument list contains the 'list' keyword and a true value, then export_kml will return a string with the kml output as lines of text rather than writing it to the file specified by the kml option.

option: kml <filespec>

Write the kml output to the file designated by <filespec>. If <filespec> is an empty string, no file is written.

option: color <str>

A color name, one of blue, green, azure, purple, red, white or yellow. Only the first character is recognized, so 'b' and 'bob' both result in blue. All colors are given an opacity of 50 (the range is 00 to ff).

flag_counts ()

The load_stations() and load_data() methods may reject a station or a particular data entry due to quality or other issues. These decisions are kept in a hash field, and a reference to that hash is returned by this method. The caller can then report the values.

get_flag_statistics ( list => 0, no_header => 0 )

Gets a header row and summary table of data points that were kept and rejected, along with counts of QFLAGS (quality flags). Returns tab-separated text, or a list if the list argument is true. A heading line is provided unless no_header is true.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

get_footer( list => 0 )

Get a footing section with explanatory notes about the output data produced by detail and summary reports.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

get_hash_stats ( list => 0, no_header => 0 )

Gets the hash sizes collected during the execution of StationTable methods, notably load_stations and load_data, as tab-separated lines of text.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

get_header ( list => 0 )

The weather data obtained by the laod_data() method is essentially a table. Which columns are returned depends on various options. For example, if report => monthly is given, then the key columns will be year and month -- no day. If the precip option is given, then extra columns are included for precipitation values.

This variabiliy makes it difficult for a consumer of these modules to emit a heading that matches the underlying columns. The purpose of this method is to return a set of column headings that will match the data. The value returned is a tab-separated string.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

get_missing_data_ranges( list => 0, no_header => 0 )

Gets a list, by station id and year, of any months or day ranges when data was found to be missing. Missing data can lead to incorrect interpretation and can cause a station to be rejected if the percent of found data does not meet the -quality threshold (normally 90%).

Returns a heading line followed by lines of tab-separated strings.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list of lists (stations containing years) is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

option: report <daily|monthly|yearly|id>

Determines the number and content of heading values.

datarow_as_hash ( $row_aref )

This is a convenience method that may be used to convert table rows returned by the row_sub callback subroutine of load_data from a perl list into a hash. It automatically calls get_header to get the headers for the table data. When you pass it a reference to a data row (obtained vis the row_sub callback routine given to load_data) it combines the elements of the data row list with the column headings and returns a hash.

get_missing_rows( list => 0 )

In support of a -nogaps option, to generate detail output that does not have any gaps due to missing data, this method gets a list of rows for the months and days that had missing data for a given station id in a given year.

Returns lines of tab-separated strings.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

option: nogaps

Emits extra rows after the detail data rows to make up for missing months or days. This is primarily so that if the data is charted by date, then the x-axis will have all the dates from start to finish. Otherwise, the chart and any trends that are projected on it will be distorted by the missing data.

get_options ( list => 0, no_header => 0 )

Get text which shows the options that were in effect for this processing run, in a Getopt style. Includes a heading and a footing with explanatory notes. If argument 'list' is true, returns the lines as a list. Line [1] contains the options string.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line or the explanatory footing notes. Default is false.

get_stations ( list => 0, kept => 1, no_header => 0 )

Return lines of text with tab-separated columns describing each of the stations for stations that were found to meet the filtering criteria specified in the user options.

argument: kept => <bool>

If the argument kept => 0 is specified, and load_data has already been invoked, then the stations which were rejected due to quality flags or missing data will be returned. If kept => 1 is specified, then the stations that were kept will be returned.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

get_station_note_list ()

Return a list consisting of tab-separated code/description pairs that rejected stations were flagged with; i.e. the reasons for their rejection.

get_summary_data ( list => 0 )

Gets a list of summarized the temperature or precipitation data by day, month or year depending on the report option.

Returns undef if the report option is 'id'.

The actual columns that are returned is dictated by the report option and by the tavg and precip options provided when the object was instantiated by new().

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

option: report <daily|monthly|yearly>

Determines the level of summarization.

option: range <rangelist>

If the range option is provided, the output rows are restricted to those years that are within the specified range(s).

get_timing_stats ( list => 0 )

Get a list of the timers, with durations and notes, in alphabetical order by timer label.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

has_missing_data ()

Returns true if any missing data was detected amongst the stations that were processed. The calling script can use this to decide whether to issue a warning to the user. A list of missing data specifics can be sent to the output by calling method get_missing_data_ranges.

load_data ( progress_sub => undef, row_sub => sub { say @_ } )

Load the daily weather data for each of the stations that are were loaded into the collection. Print the data if option report id is given. Otherwise cache the data so it can be aggregated at a later step.

argument: progress_sub => undef

As fetching and parsing each daily data page can take some time, an optional callback hook is provided so the caller can emit a progress message before each station's data is loaded; e.g. progress => sub{ say {STDERR} @_ }.

argument: row_sub => sub { say @_ }

Optional callback hook to allow the caller to provide their own subroutine for printing (or collecting in a list, or both) the row-level station data that is fetched when the report option is 'id'. Defaults to printing via the 'say' operator.

option: report <id|daily|monthly|yearly>

When report id is specified, the weather data for each station is printed immediately (via the row_sub callback hook).

For all other report options, the data is fetched from each station and kept in a cache so that it can be aggregated by invoking summarize_data(). The row_sub hook is not invoked.

load_stations ()

Read the GHCN stations list and the stations inventory list and create a hash of Station objects, keyed on station id, filtered according to the options provided in set_options().

Returns a hash of GHCN::Station objects, keyed on station id.

option: country <str>

Selects only those stations that match the 2-digit GEC (formerly FIPS) country code or that uniquely match the name or partial name given in <str>.

option: state <code>

Selects only those stations that match a US state or Canadian provinc code.

option: location <str>

Selects only those stations with a name that matches the specified pattern, which can be either a station id, or a comma-separated list of station id's, or a regex. If a regex, then it is anchored on the left and whitespace is NOT ignored.

option: gps <latitude,longitude>

This option selects stations within a certain radius of the designated latitude and longitude, expressed as positive and negative numbers (not using N, S, W, E designators).

option: radius <int>

In conjunction the gps options, determines the radius in kilometers for the search area. Defaults to 25 km.

option: gsn

Select only GCOS Surface Network stations, which is a baseline network comprising a subset of about 1000 stations chosen mainly to give a fairly uniform spatial coverage from places where there is a good length and quality of data record. See "/www.ncdc.noaa.gov/gosic/global-climate-observing-system-gcos/g cos-surface-network-gsn-program-overview" in https:

($opt, @errors) = set_options ( %args )

Set various options for this StationTable instance. These options will affect the processing and output by subsequent method calls.

Returns an Option object and a list of errors. It is advised you check @errors after calling set_options cease processing; e.g. die @errors if @errors.

You may want to set up a file-scoped lexical variable to hold the options object. That way it is accessible throughout your code. The typical calling pattern would look like this:

my $Opt;  # a file-scope lexical

sub run (@ARGV) {
    my $ghcn = GHCN::StationTable->new;

    my @errors;
    ($Opt, @errors) = set_options(...);
    die @errors if @errors;
    ...
}
user_options => \%user_options

This optional argument provides a reference to a hash that contains a set of options that will control the filtering, processing and output of the GHCN modules. This hash is typically created by the caller using Getopt::Long.

The options provided can be any subset of the supported options. Any option not provided will be added with an appropriate default value. The resulting combined option collection will be available as both as hash reference in the instance, and as a Hash::Wrap object reference in the instance via methods.

If empty or undef, a list of all stations in the GHCN database will be generated, so it's best to at least provide some country or station id filtering, and absolutely necessary in order to produce other output such as daily or monthly weather data (by specifying -report).

See USER OPTIONS for a list of the options available.

config_file => $config_filespec

This optional argument specifies a file which will be used to set the configuration options. The file must contain YAML specifications that describe the hash structure defined in section CONFIGURATION OPTIONS.

This option is an alternative to config_options. (If both options are specifed, then config_options will take precedence.)

If config_filespec is an empty string, then the filespec will default to $HOME\ghcn_fetch.yaml (%UserProfile% on Windows).

If config_filespec is undef, then an empty configuration will be used; i.e. there will be no cache and no aliases.

config_options => \%config_options

This optional argument is a reference to a hash containing configuration options as described in section CONFIGURATION OPTION. Alternatively, config_file can be used to specify a file containing the configuation specification in YAML format.

stnid_filter => \%stnid_filter

This optional argument should be a reference to a hash whose keys are the specific station id's which are to be fetched and processed. When this is used, many filtering options via %opt will be overridden (e.g. -country).

timing_stats => $TimingStats_obj

This optional argument should point to a TimingStats object that was created by the caller and will be used to collect timing statistics.

hash_stats => \%hash_stats

This optional argument should be a reference to a hash that was created by the caller and will be used to collect performance and memory statistics.

return_list => <bool>

By default, get methods return a tab-separated string of results. If return_list is set to true, then these methods will return a list (or list of lists).

summarize_data ()

Aggregate the daily weather data for the stations that were loaded, according to the report option.

option: report => 'daily|monthly|yearly'

When the report option is 'id', no summarization is needed and the method immediately returns undef.

tstats ()

Provides access to the TimingStats object so the caller can start and stop script-level timers.

DOES

Defined by Object::Pad. Included for POD::Coverage.

META

Defined by Object::Pad. Included for POD::Coverage.

EXAMPLE PROGRAM

use GHCN::StationTable;

my $ghcn = GHCN::StationTable->new;

my ($opt, @errors) = $ghcn->set_options(
  user_options => {
      country     => 'US',
      state       => 'NY',
      location    => 'New York',
      active      => '2000-2022',
      report      => 'yearly',
      nonetwork   => -1,      # refresh cache if stale this year
  },
  config_options => {
      cache => {
          root => 'c:/ghcn_cache',
          namespace => 'ghcn',
      },
  },
);

die @errors if @errors;

$ghcn->load_stations;

my @rows;
if ($opt->report) {
    say $ghcn->get_header;

    # this also prints detailed station data if $opt->report eq 'id'
    $ghcn->load_data(
      # set a callback routine for printing progress messages
      progress_sub => sub { say {*STDERR} @_ },
      # set a callback routine for capturing data rows when report => 'id'
      row_sub      => sub { push @rows, $_[0] },
    );

    # these only do something when $opt->report ne 'id'
    $ghcn->summarize_data;
    say $ghcn->get_summary_data;

    say '';
    say $ghcn->get_footer;

    say '';
    say $ghcn->get_flag_statistics;
}

# print data rows collected by row_sub callback (when report => 'id')
foreach my $row_aref (@rows) {
    say join "\t", $row_aref->@*;
}

say '';
say $ghcn->get_stations( kept => 1 );

say '';
say 'Stations that failed to meet range or quality criteria:';
say $ghcn->get_stations( kept => 0, no_header => 1 );

if ( $ghcn->has_missing_data ) {
    warn '*W* some data was missing for the stations and date range processed' . $NL;
    say '';
    say $ghcn->get_missing_data_ranges;
}

say $ghcn->get_options;

say $ghcn->get_timing_stats;

say $ghcn->get_hash_stats;

$ghcn->export_kml if $opt->kml;

CONFIGURATION OPTIONS

StationTable supports two kinds of options: user and configuration. The main difference between the two is that configuration options are more suited to persistence; i.e. you'll most likely put them in a file that is used at every execution of StationTable.

Cache

Cache options are used internally by StationTable when it calls URI::Fetch to get pages of data from the GHCN web respository.

root

This defines a path to a folder which will be used to cache web pages. See the nonetwork user option for ways to control caching.

namespace

This defines the subfolder of root within which the cache files will reside.

Aliases

Aliases are a convenience feature that allow you to define mnemonic shortcuts for specific stations. GHCN station id's (like CA006106000) are difficult to remember and type, as can GHCN station names. Frequently-used station id's can be given easier alias names that can be use in the -location option for precise and reliable data retrieval.

The entries within the aliases hash are simply keyword/value pairs that represent the mnemonic alias name and the station id (or id's) that are to be retrieved when that alias is used in -location.

YAML Example

This is what the YAML content for a typical configuation file would look like:

---
cache:
    root: C:/ghcn_cache_new
    namespace: ghcn_new

aliases:
    yow: CA006106000,CA006106001    # Ottawa airport
    cda: CA006105976,CA006105978    # Ottawa (CDA and CDA RCS)

Hash Example

Here's what the typical config file would look like as a perl hash structure:

config_options => {
    cache => {
        root        => 'C:/ghcn_cache_new',
        namespace   => 'ghcn_new',
    }
    aliases => {
        yow => 'CA006106000,CA006106001',    # Ottawa airport
        cda => 'CA006105976,CA006105978',    # Ottawa (CDA and CDA RCS)
    }
}

USER OPTIONS

See ghcn_fetch.pl -help for a list of all user options in Getopts::Long format. Simply translate to a hash key/value pair. For example, -report id becomes report = 'id'>.

VERSIONING and COMPATIBILITY

The version number scheme used for this module consists of a 3-part dot-delimited string such as v0.22.365. This format was chosen for compatibility with Dist::Zilla version support, so that all modules in GHCN will get the same version number upon release. See also https://metacpan.org/pod/version.

The first digit of the string is a major release numbers. With the exception of v0 release, which should be considered experimental pre-production versions, the interface is intended to be upward compatible within a set of releases sharing the same major release number. If an incompatible change becomes necessary, the major release number will be incremented.

The other two strings are essentially the date of the release, in the format YY.DDD where YY is the year of the century and DDD is the day number within the year.

AUTHOR

Gary Puckering (jgpuckering@rogers.com)

LICENSE AND COPYRIGHT

Copyright 2022, Gary Puckering