NAME
Parse::Taxonomy - Validate hierarchical data stored in CSV format
VERSION
This document refers to version 0.16 of Parse::Taxonomy. This version was released October 18 2015.
SYNOPSIS
use Parse::Taxonomy;
DESCRIPTION
This module is the base class for the Parse-Taxonomy extension to the Perl 5 programming language. You will not instantiate objects of this class; rather, you will instantiate objects of subclasses, of which Parse::Taxonomy::MaterializedPath and Parse::Taxonomy::AdjacentList are the first.
This is an BETA release. The documented interfaces are expected to remain stable but are not guaranteed to remain so.
Taxonomy: definition
For the purpose of this library, a taxonomy is defined as a tree-like data structure with a root node, zero or more branch (child) nodes, and one or more leaf nodes. The root node and each branch node must have at least one child node, but leaf nodes have no child nodes. The number of branches between a leaf node and the root node is variable.
Diagram 1:
Root
|
----------------------------------------------------
| | | |
Branch Branch Branch Leaf
| | |
------------------------- ------------ |
| | | | |
Branch Branch Leaf Leaf Branch
| | |
| ------------ |
| | | |
Leaf Leaf Leaf Leaf
Taxonomy File: definition
For the purpose of this module, a taxonomy file is a CSV file in which (a) certain columns hold data from which the position of each record within the taxonomy can be derived; and (b) each node in the tree (with the possible exception of the root node) is uniquely represented by a record within the file.
CSV
"CSV", strictly speaking, refers to comma-separated values:
path,nationality,gender,age,income,id_no
For the purpose of this module, however, the column separators in a taxonomy file may be any user-specified character handled by the Text-CSV_XS library on CPAN. Formats frequently observed are tab-separated values:
path nationality gender age income id_no
and pipe-separated values:
path|nationality|gender|age|income|id_no
The documentation for Text-CSV comments that the CSV format could "... perhaps better [be] called ASV (anything separated values)", but we shall for convenience use "CSV" herein regardless of the specific delimiter.
Since it is often the case that the characters used as column separators may occur within the data recorded in the columns as well, it is customary to quote either all columns:
"path","nationality","gender","age","income","id_no"
... or, at the very least, all columns which can hold data other than pure integers or floating-point numbers:
"path","nationality","gender",age,income,id_no
Tree structure
To qualify as a taxonomy file, it is not sufficient for a file to be in CSV format. In each non-header record in that file, there must be one or more columns which hold data capable of exactly specifying the record's position in the taxonomy, i.e., the route from the root node to the node being represented by that record.
The precise way in which certain columns are used to determine the path from the root node to a given node is what differentiates various types of taxonomy files from one another. In Parse-Taxonomy we identify two different flavors of taxonomy files and provide a class for the construction of each.
Taxonomy-by-materialized-path
A taxonomy-by-materialized-path is one in which a single column -- which we will refer to as the path column -- serves as a materialized path. A materialized path represents the route from the root to the given record as a series of strings joined by separator characters. Within that path column the value corresponding to the root node need not be specified, i.e., may be represented by an empty string.
Let's rewrite Diagram 1 with values to make this clear.
Diagram 2:
""
|
----------------------------------------------------
| | | |
Alpha Beta Gamma Delta
| | |
------------------------- ------------ |
| | | | |
Epsilon Zeta Eta Theta Iota
| | |
| ------------ |
| | | |
Kappa Lambda Mu Nu
Let us suppose that our taxonomy file held comma-separated, quoted records. Let us further supposed that the column holding taxonomy paths was, not surprisingly, called path
and that the separator within the path
column was a pipe (|
) character. Let us further suppose that for now we are not concerned with the data in any columns other than path
so that, for purpose of illustration, they will hold empty (albeit quoted) strings.
Then the taxonomy file describing the tree in Diagram 2 would look like this:
"path","nationality","gender","age","income","id_no"
"|Alpha","","","","",""
"|Alpha|Epsilon","","","","",""
"|Alpha|Epsilon|Kappa","","","","",""
"|Alpha|Zeta","","","","",""
"|Alpha|Zeta|Lambda","","","","",""
"|Alpha|Zeta|Mu","","","","",""
"|Beta","","","","",""
"|Beta|Eta","","","","",""
"|Beta|Theta","","","","",""
"|Gamma","","","","",""
"|Gamma|Iota","","","","",""
"|Gamma|Iota|Nu","","","","",""
"|Delta","","","","",""
Note that while in the |Gamma
branch we ultimately have only one leaf node, |Gamma|Iota|Nu
, we require separate records in the taxonomy file for |Gamma
and |Gamma|Iota
. To put this another way, the existence of a Gamma|Iota|Nu
leaf must not be assumed to "auto-vivify" |Gamma
and |Gamma|Iota
nodes. Each non-root node must be explicitly represented in the taxonomy file for the file to be considered valid.
Note further that there is no restriction on the values of the components of the path
across records. It only the full path that must be unique. Let us illustrate that by modifying the data in Diagram 2:
Diagram 3:
""
|
----------------------------------------------------
| | | |
Alpha Beta Gamma Delta
| | |
------------------------- ------------ |
| | | | |
Epsilon Zeta Eta Theta Iota
| | |
| ------------ |
| | | |
Kappa Lambda Mu Delta
Here we have two leaf nodes each named Delta
. However, we follow different paths from the root node to get to each of them. The taxonomy file representing this tree would look like this:
"path","nationality","gender","age","income","id_no"
"|Alpha","","","","",""
"|Alpha|Epsilon","","","","",""
"|Alpha|Epsilon|Kappa","","","","",""
"|Alpha|Zeta","","","","",""
"|Alpha|Zeta|Lambda","","","","",""
"|Alpha|Zeta|Mu","","","","",""
"|Beta","","","","",""
"|Beta|Eta","","","","",""
"|Beta|Theta","","","","",""
"|Gamma","","","","",""
"|Gamma|Iota","","","","",""
"|Gamma|Iota|Delta","","","","",""
"|Delta","","","","",""
Taxonomy-by-adjacent-list
A taxonomy-by-adjacent-list is one in which each record has a column with a unique identifier (id) and another column holding the unique identifier of the record representing the next higher node in the hierarchy (parent_id). The record must also have a column which holds a datum that is unique among all records having the same parent node.
Let's make this clearer by rewriting the taxonomy-by-materialized-path above for Example 3 as a taxonomy-by-adjacent-list.
"id","parent_id","name","nationality","gender","age","income","id_no"
1,,"Alpha","","","","",""
2,1,"Epsilon","","","","",""
3,2,"Kappa","","","","",""
4,1,"Zeta","","","","",""
5,4,"Lambda","","","","",""
6,4,"Mu","","","","",""
7,,"Beta","","","","",""
8,7,"Eta","","","","",""
9,7,"Theta","","","","",""
10,,"Gamma","","","","",""
11,10,"Iota","","","","",""
12,11,"Delta","","","","",""
13,,"Delta","","","","",""
In the above taxonomy-by-adjacent-list, the records with id
s 1
, 7
, 10
, and 13
are top-level nodes. They have no parents, so the value of their parent_id
column is null or, in Perl terms, an empty string. The records with id
s 2
and 4
are children of the record with id
of 1
. The record with id 3
is, in turn, a child of the record with id 2
.
In the above taxonomy-by-adjacent-list, close inspection will show that no two records with the same parent_id
share the same name
. The property of uniqueness of sibling names means that we can construct a non-indexed version of the path from the root to a given node by using the parent_id
column in a given record to look up the name
of the record with the id
value identical to the child's parent_id
.
Via index: 3 2 1
Via name: Kappa Epsilon Alpha
We go from id 3
to its parent_id
, <2>, then to 2
's parent_id
, <1>. Putting names to this, we go from Kappa
to Epsilon
to Alpha
.
Now, reverse the order of those name
s, throw a pipe delimiter before each of them and join them into a single string, and you get:
|Alpha|Epsilon|Kappa
... which is the value of the path
column in the third record in the taxonomy-by-materialized-path displayed previously.
With correct data, a given hierarchy of data can therefore be represented either by a taxonomy-by-materialized-path or by a taxonomy-by-adjacent-list. This permits us to describe these two taxonomies as equivalent to each other.
Taxonomy Validation
Each Parse::Taxonomy
subclass will have a constructor, new()
, whose principal interface will take the name of a taxonomy file as an argument. We will call this interface the file interface to the constructor. The purpose of the constructor will be to determine whether the taxonomy file holds a valid taxonomy according to the description provided above. The arguments needed for such a constructor will be found in the documentation of the subclass.
The constructor of a Parse::Taxonomy
subclass may, if desired, accept a different set of arguments. Suppose you have already read a CSV file and parsed it into one array reference holding its header row -- a list of its columns -- and a second array reference, this one being an array of arrays where each element holds the data in one record in the CSV file. You have the same components needed to validate the taxonomy that you would get by parsing the CSV file, so your subclass may implement a components interface as well as a file interface.
You should now proceed to read the documentation for Parse::Taxonomy::MaterializedPath and Parse::Taxonomy::AdjacentList.
BUGS
There are no bug reports outstanding on Parse::Taxonomy as of the most recent CPAN upload date of this distribution.
SUPPORT
Please report any bugs by mail to bug-Parse-Taxonomy@rt.cpan.org
or through the web interface at http://rt.cpan.org.
AUTHOR
James E. Keenan (jkeenan@cpan.org). When sending correspondence, please include 'Parse::Taxonomy' or 'Parse-Taxonomy' in your subject line.
Creation date: May 24 2015. Last modification date: October 18 2015.
Development repository: https://github.com/jkeenan/parse-taxonomy
REFERENCES
DBIx-Class-MaterializedPath by Arthur Axel "fREW" Schmidt
DBIx-Tree-MaterializedPath by Larry Leszczynski
Trees in SQL: Nested Sets and Materialized Path by Vadim Tropashko
DBIx-Tree, now maintained by Ron Savage.
COPYRIGHT
Copyright (c) 2002-15 James E. Keenan. United States. All rights reserved. This is free software and may be distributed under the same terms as Perl itself.
DISCLAIMER OF WARRANTY
BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE ''AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.