NAME

Parse::File::Taxonomy - Validate a file for use as a taxonomy

SYNOPSIS

use Parse::File::Taxonomy;

$source = "./t/data/alpha.csv";
$obj = Parse::File::Taxonomy->new( {
    file    => $source,
} );

$hashified_taxonomy = $obj->hashify_taxonomy();

DESCRIPTION

This module takes as input a plain-text file, verifies that it can be used as a taxonomy, then creates a Perl data structure representing that taxonomy.

This is an ALPHA release.

Taxonomy: definition

For the purpose of this module, a taxonomy is defined as a tree-like data structure with a root node, zero or more branch (child) nodes, and one or more leaf nodes. The root node and each branch node must have at least one child node, but leaf nodes have no child nodes. The number of branches between a leaf node and the root node is variable.

Diagram 1:

                           Root
                            |
              ----------------------------------------------------
              |                            |            |        |
           Branch                       Branch       Branch     Leaf
              |                            |            |
   -------------------------         ------------       |
   |                       |         |          |       |
Branch                  Branch     Leaf       Leaf   Branch
   |                       |                            |
   |                 ------------                       |
   |                 |          |                       |
 Leaf              Leaf       Leaf                    Leaf

Taxonomy File: definition

For the purpose of this module, a taxonomy file is (a) a CSV file in which one column holds data on the position of each record within the taxonomy and (b) in which each node in the tree other than the root node is uniquely represented by a record within the file.

CSV

"CSV", strictly speaking, refers to comma-separated values:

path,nationality,gender,age,income,id_no

For the purpose of this module, however, column separators in a taxonomy file may be any user-specified character handled by the Text-CSV library. Formats frequently observed are tab-separated values:

path	nationality	gender	age	income	id_no

and pipe-separated values:

path|nationality|gender|age|income|id_no

The documentation for Text-CSV comments that the CSV format could <I"... perhaps better [be] called ASV (anything separated values)">, but we shall for convenience use "CSV" herein regardless of the specific delimiter.

Since it is often the case that the characters used as column separators may occur within the data recorded in the columns as well, it is customary to quote either all columns:

"path","nationality","gender","age","income","id_no"

or, at the very least, all columns which can hold data other than pure integers or floating-point numbers:

"path","nationality","gender",age,income,id_no

Tree structure

To qualify as a taxonomy file, it is not sufficient for a file to be in CSV format. In each non-header record in that file, one column must hold data capable of exactly specifying the record's position in the taxonomy, i.e., the route or path from the root node to the node being represented by that record. That data must itself be in delimiter-separated format. Each non-root node in the taxonomy must have exactly one record holding its path data. Within that path column the value corresponding to the root node need not be specified, i.e., may be represented by an empty string.

Let's rewrite Diagram 1 with values to make this clear.

Diagram 2:

                           ""
                            |
              ----------------------------------------------------
              |                            |            |        |
            Alpha                        Beta         Gamma    Delta
              |                            |            |
   -------------------------         ------------       |
   |                       |         |          |       |
Epsilon                  Zeta       Eta       Theta   Iota
   |                       |                            |
   |                 ------------                       |
   |                 |          |                       |
 Kappa            Lambda        Mu                      Nu

Let us suppose that our taxonomy file held comma-separated, quoted records. Let us further supposed that the column holding taxonomy paths was, not surprisingly, called path and that the separator within the path column was a pipe (|) character. Let us further suppose that for now we are not concerned with the data in any columns other than path so that, for purpose of illustration, they will hold empty (albeit quoted) string.

Then the taxonomy file describing the tree in Diagram 2 would look like this:

"path","nationality","gender","age","income","id_no"
"|Alpha","","","","",""
"|Alpha|Epsilon","","","","",""
"|Alpha|Epsilon|Kappa","","","","",""
"|Alpha|Zeta","","","","",""
"|Alpha|Zeta|Lambda","","","","",""
"|Alpha|Zeta|Mu","","","","",""
"|Beta","","","","",""
"|Beta|Eta","","","","",""
"|Beta|Theta","","","","",""
"|Gamma","","","","",""
"|Gamma|Iota","","","","",""
"|Gamma|Iota|Nu","","","","",""
"|Delta","","","","",""

Note that while in the |Gamma branch we ultimately have only one leaf node, |Gamma|Iota|Nu, we require separate records in the taxonomy file for |Gamma and |Gamma|Iota. To put this another way, the existence of a Gamma|Iota|Nu leaf must not be taken to "auto-vivify" |Gamma and |Gamma|Iota nodes. Each non-root node must be explicitly represented in the taxonomy file for the file to be considered valid.

Note further that there is no restriction on the values of the components of the path across records. It only the full path that must be unique. Let us illustrate that by modifying the data in Diagram 2:

Diagram 3:

                           ""
                            |
              ----------------------------------------------------
              |                            |            |        |
            Alpha                        Beta         Gamma    Delta
              |                            |            |
   -------------------------         ------------       |
   |                       |         |          |       |
Epsilon                  Zeta       Eta       Theta   Iota
   |                       |                            |
   |                 ------------                       |
   |                 |          |                       |
 Kappa            Lambda        Mu                    Delta

Here we have two leaf nodes each named Delta. However, we follow different paths from the root node to get to each of them. The taxonomy file representing this tree would look like this:

"path","nationality","gender","age","income","id_no"
"|Alpha","","","","",""
"|Alpha|Epsilon","","","","",""
"|Alpha|Epsilon|Kappa","","","","",""
"|Alpha|Zeta","","","","",""
"|Alpha|Zeta|Lambda","","","","",""
"|Alpha|Zeta|Mu","","","","",""
"|Beta","","","","",""
"|Beta|Eta","","","","",""
"|Beta|Theta","","","","",""
"|Gamma","","","","",""
"|Gamma|Iota","","","","",""
"|Gamma|Iota|Delta","","","","",""
"|Delta","","","","",""

Taxonomy Validation

The Parse::File::Taxonomy constructor, new(), will probe a taxonomy file provided to it as an argument to determine whether it can be considered a valid taxonomy according to the description provided above.

TODO: Parse::File::Taxonomy-new() should also be able to accept a reference to an array of CSV records already held in memory.

TODO: What would it mean for Parse::File::Taxonomy-new() to accept a filehandle as an argument, rather than a file? Would that be difficult to implement?

TODO: The user of this library, however, must be permitted to write additional user-specified validation rules which will be applied to a taxonomy by means of a local_validate() method called on a Parse::File::Taxonomy object. Should the file fail to meet those rules, the user may choose not to proceed further even though the taxonomy meets the basic validation criteria implemented in the constructor. This method will take a reference to an array of subroutines references as its argument. Each such code reference will be a user-defined rule which the taxonomy must obey. The method will apply each code reference to the taxonomy in sequence and will return with a true value if and only if all the individual criteria return true as well.

METHODS

new()

  • Purpose

    Parse::File::Taxonomy constructor.

  • Arguments

    $source = "./t/data/alpha.csv";
    $obj = Parse::File::Taxonomy->new( {
        file    => $source,
    } );

    Single hash reference. Elements in that hash are keyed on:

    • file

      Absolute or relative path to the incoming taxonomy file. Currently required (but this may change if we implement ability to use a list of CSV strings instead of a file).

    • path_col_idx

      If the column to be used as the "path" column in the incoming taxonomy file is not the first column, this option must be set to the integer representing the "path" column's index position (count starts at 0). Optional; defaults to 0.

    • path_col_sep

      If the string used to distinguish components of the path in the path column in the incoming taxonomy file is not a pipe (|), this option must be set. Optional; defaults to |.

    • Text::CSV options

      Any other options which could normally be passed to Text::CSV-new()> will be passed through to that module's constructor. On the recommendation of the Text::CSV documentation, binary is always set to a true value.

  • Return Value

    Parse::File::Taxonomy object.

  • Comment

    new() will throw an exception under any of the following conditions:

    • Argument to new() is not a reference.

    • Argument to new() is not a hash reference.

    • Unable to locate the file which is the value of the file element.

    • Argument to path_col_idx element is not an integer.

    • Argument to path_col_idx is greater than the index number of the last element in the header row of the incoming taxonomy file, i.e., the path_col_idx is wrong.

    • The same field is found more than once in the header row of the incoming taxonomy file.

    • Unable to open or close the incoming taxonomy file for reading.

    • In the column designated as the "path" column, the same value is observed more than once.

    • A non-parent node's parent node cannot be located in the incoming taxonomy file.

    • A data row has a number of fields different from the number of fields in the header row.

fields()

  • Purpose

    Identify the names of the columns in the taxonomy.

  • Arguments

    my $fields = $self->fields();

    No arguments; the information is already inside the object.

  • Return Value

    Reference to an array holding a list of the columns as they appear in the header row of the incoming taxonomy file.

  • Comment

    Read-only.

path_col_idx()

  • Purpose

    Identify the index position (count starts at 0) of the column in the incoming taxonomy file which serves as the path column.

  • Arguments

    my $path_col_idx = $self->path_col_idx;

    No arguments; the information is already inside the object.

  • Return Value

    Integer in the range from 0 to 1 less than the number of columns in the header row.

  • Comment

    Read-only.

path_col()

  • Purpose

    Identify the name of the column in the incoming taxonomy which serves as the path column.

  • Arguments

    my $path_col = $self->path_col;

    No arguments; the information is already inside the object.

  • Return Value

    String.

  • Comment

    Read-only.

path_col_sep()

  • Purpose

    Identify the string used to separate path components once the taxonomy has been created. This is just a "getter" and is logically distinct from the option to new() which is, in effect, a "setter."

  • Arguments

    my $path_col_sep = $self->path_col_sep;

    No arguments; the information is already inside the object.

  • Return Value

    String.

  • Comment

    Read-only.

data_records()

  • Purpose

    Once the taxonomy has been validated, get a list of its data rows as a Perl data structure.

  • Arguments

    $data_records = $self->data_records;

    None.

  • Return Value

    Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.

  • Comment

    Does not contain any information about the fields in the taxonomy, so you should probably either (a) use in conjunction with fields() method above; or (b) use fields_and_data_records().

fields_and_data_records()

  • Purpose

    Once the taxonomy has been validated, get a list of its header and data rows as a Perl data structure.

  • Arguments

    $data_records = $self->fields_and_data_records;

    None.

  • Return Value

    Reference to array of array references. The first element in the array will hold the header row (same as output of fields()). The remaining elements will hold the data records found in the incoming taxonomy file in their order in that file.

data_records_path_components()

  • Purpose

    Once the taxonomy has been validated, get a list of its data rows as a Perl data structure. In each element of this list, the path is now represented as an array reference rather than a string.

  • Arguments

    $data_records_path_components = $self->data_records_path_components;

    None.

  • Return Value

    Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.

  • Comment

    Does not contain any information about the fields in the taxonomy, so you should probably either (a) use in conjunction with fields() method above; or (b) use fields_and_data_records_path_components().

fields_and_data_records_path_components()

  • Purpose

    Once the taxonomy has been validated, get a list of its data rows as a Perl data structure. The first element in this list is an array reference holding the header row. In each data element of this list, the path is now represented as an array reference rather than a string.

  • Arguments

    $fields_and_data_records_path_components = $self->fields_and_data_records_path_components;

    None.

  • Return Value

    Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.

child_counts()

  • Purpose

    Display the number of descendant (multi-generational) nodes each node in the taxonomy has.

  • Arguments

    $child_counts = $self->child_counts();

    None.

  • Return Value

    Reference to hash in which each element is keyed on the value of the path column in the incoming taxonomy file.

get_child_count()

  • Purpose

    Get the total number of descendant nodes for one specific node in a validated taxonomy.

  • Arguments

    $child_count = $self->get_child_count('|Path|To|Node');

    String containing node's path as spelled in the taxonomy.

  • Return Value

    Unsigned integer >= 0. Any node whose child count is 0 is by definition a leaf node.

  • Comment

    Will throw an exception if the node does not exist or is misspelled.

hashify_taxonomy()

  • Purpose

    Turn a validated taxonomy into a Perl hash keyed on the column designated as the path column.

  • Arguments

    $hashref = $self->hashify_taxonomy();

    Takes an optional hashref holding a list of any of the following elements:

    • remove_leading_path_col_sep

      Boolean, defaulting to 0. By default, hashify_taxonomy() will spell the key of the hash exactly as the value of the path column is spelled in the taxonomy -- which in turn is the way it was spelled in the incoming file. That is, a path in the taxonomy spelled |Alpha|Beta|Gamma will be spelled as a key in exactly the same way.

      However, since in many cases (including the example above) the root node of the taxonomy will be empty, the user may wish to remove the first instance of path_col_sep. The user would do so by setting remove_leading_path_col_sep to a true value.

      $hashref = $self->hashify_taxonomy( {
          remove_leading_path_col_sep => 1,
      } );

      In that case they key would now be spelled: Alpha|Beta|Gamma.

      Note further that if the root_str switch is set to a true value, any setting to remove_leading_path_col_sep will be ignored.

    • key_delim

      A string which will be used in composing the key of the hashref returned by this method. The user may select this key if she does not want to use the value found in the incoming CSV file (which by default will be the pipe character (|) and which may be overridden with the path_col_sep argument to new().

      $hashref = $self->hashify_taxonomy( {
          key_delim   => q{ - },
      } );

      In the above variant, a path that in the incoming taxonomy file was represented by |Alpha|Beta|Gamma will in $hashref be represented by - Alpha - Beta - Gamma.

    • root_str

      A string which will be used in composing the key of the hashref returned by this method. The user will set this switch if she wishes to have the root note explicitly represented. Using this switch will automatically cause remove_leading_path_col_sep to be ignored.

      Suppose the user wished to have All Suppliers be the text for the root node. Suppose further that the user wanted to use the string - as the delimiter within the key.

      $hashref = $self->hashify_taxonomy( {
          root_str    => q{All Suppliers},
          key_delim   => q{ - },
      } );

      Then incoming path |Alpha|Beta|Gamma would be keyed as:

      All Suppliers - Alpha - Beta - Gamma
  • Return Value

    Hash reference. The number of elements in this hash should be equal to the number of non-header records in the taxonomy.