NAME
Parse::File::Taxonomy::Index - Extract a taxonomy from a hierarchy inside a CSV file
SYNOPSIS
use Parse::File::Taxonomy::Index;
$source = "./t/data/alpha.csv";
$obj = Parse::File::Taxonomy::Index->new( {
file => $source,
} );
METHODS
new()
Purpose
Parse::File::Taxonomy::Index constructor.
Arguments
Single hash reference. There are two possible interfaces:
file
andcomponents
.- 1
file
interface -
$source = "./t/data/delta.csv"; $obj = Parse::File::Taxonomy::Index->new( { file => $source, } );
Elements in the hash reference are keyed on:
file
Absolute or relative path to the incoming taxonomy file. Required for this interface.
id_col
The name of the column in the header row under which each data record's unique ID can be found. Defaults to
id
.parent_id_col
The name of the column in the header row under which each data record's parent ID can be found. (Will be empty in the case of top-level nodes, as they have no parent.) Defaults to
parent_id
.component_col
The name of the column in the header row under which, in each data record, there is a found a string which differentiates that record from all other records with the same parent ID. Defaults to
name
.Text::CSV options
Any other options which could normally be passed to
Text::CSV-
new()> will be passed through to that module's constructor. On the recommendation of the Text::CSV documentation,binary
is always set to a true value.
- 2
components
interface -
$obj = Parse::File::Taxonomy::Index->new( { components => { fields => $fields, data_records => $data_records, } } );
Elements in this hash are keyed on:
components
This element is required for the
components
interface. The value of this element is a hash reference with two keys,fields
anddata_records
.fields
is a reference to an array holding the field or column names for the data set.data_records
is a reference to an array of array references, each of the latter arrayrefs holding one record or row from the data set.
- 1
Return Value
Parse::File::Taxonomy::Index object.
Exceptions
new()
will throw an exception under any of the following conditions:Argument to
new()
is not a reference.Argument to
new()
is not a hash reference.Argument to
new()
must have either 'file' or 'components' element but not both.Lack columns in header row to match requirements.
Non-numeric entry in
id
orparent_id
column.Duplicate entries in
id
column.Number of fields in a data record does not match number in header row.
Empty string in a
component
column of a record.Unable to locate a record whose
id
is theparent_id
of a different record.No records with same
parent_id
may share value ofcomponent
column.file
interfaceIn the
file
interface, unable to locate the file which is the value of thefile
element.The same field is found more than once in the header row of the incoming taxonomy file.
Unable to open or close the incoming taxonomy file for reading.
components
interfaceIn the
components
interface,components
element must be a hash reference withfields
anddata_records
elements.fields
element must be array reference.data_records
element must be reference to array of array references.No duplicate fields in
fields
element's array reference.
fields()
Purpose
Identify the names of the columns in the taxonomy.
Arguments
my $fields = $self->fields();
No arguments; the information is already inside the object.
Return Value
Reference to an array holding a list of the columns as they appear in the header row of the incoming taxonomy file.
Comment
Read-only.
# Implemented in lib/Parse/File/Taxonomy.pm
data_records()
Purpose
Once the taxonomy has been validated, get a list of its data rows as a Perl data structure.
Arguments
$data_records = $self->data_records;
None.
Return Value
Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.
Comment
Does not contain any information about the fields in the taxonomy, so you should probably either (a) use in conjunction with
fields()
method above; or (b) usefields_and_data_records()
.
# Implemented in lib/Parse/File/Taxonomy.pm
get_field_position()
Purpose
Identify the index position of a given field within the header row.
Arguments
$index = $obj->get_field_position('income');
Takes a single string holding the name of one of the fields (column names).
Return Value
Integer representing the index position (counting from
0
) of the field provided as argument. Throws exception if the argument is not actually a field.
Accessors
The following methods provide information about key columns in a Parse::File::Taxonomy::Path object. The key columns are those which hold the ID, parent ID and component information. They take no arguments. The methods whose names end in _idx
return integers, as they return the index position of the column in the header row. The other methods return strings.
$index_of_id_column = $self->id_col_idx;
$name_of_id_column = $self->id_col;
$index_of_parent_id_column = $self->parent_id_col_idx;
$name_of_parent_id_column = $self->parent_id_col;
$index_of_component_column = $self->component_col_idx;
$name_of_component_column = $self->component_col;
pathify()
Purpose
Generate a new Perl data structure which holds the same information as a Parse::File::Taxonomy::Index object but which expresses the route from the root node to a given branch or leaf node as either a separator-delimited string (as in the
path
column of a Parse::File::Taxonomy::Path object) or as an array reference holding the list of names which delineate that route.Another way of expressing this: Transform a taxonomy-by-index to a taxonomy-by-path.
Example: Suppose we have a CSV file which serves as a taxonomy-by-index for this data:
"id","parent_id","name","is_actionable" "1","","Alpha","0" "2","","Beta","0" "3","1","Epsilon","0" "4","3","Kappa","1" "5","1","Zeta","0" "6","5","Lambda","1" "7","5","Mu","0" "8","2","Eta","1" "9","2","Theta","1"
Instead of having the route from the root node to a given node be represented implicitly by following
parent_id
s up the tree, suppose we want that route to be represented by a string. Assuming that we work with default column names, that would mean representing the information currently spread out among theid
,parent_id
andname
columns in a singlepath
column which, by default, would hold an array reference.$source = "./t/data/theta.csv"; $obj = Parse::File::Taxonomy::Index->new( { file => $source, } ); $taxonomy_with_path_as_array = $obj->pathify;
Yielding:
[ ["path", "is_actionable"], [["", "Alpha"], 0], [["", "Beta"], 0], [["", "Alpha", "Epsilon"], 0], [["", "Alpha", "Epsilon", "Kappa"], 1], [["", "Alpha", "Zeta"], 0], [["", "Alpha", "Zeta", "Lambda"], 1], [["", "Alpha", "Zeta", "Mu"], 0], [["", "Beta", "Eta"], 1], [["", "Beta", "Theta"], 1], ]
If we wanted the path information represented as a string rather than an array reference, we would say:
$taxonomy_with_path_as_string = $obj->pathify( { as_string => 1 } );
Yielding:
[ ["path", "is_actionable"], ["|Alpha", 0], ["|Beta", 0], ["|Alpha|Epsilon", 0], ["|Alpha|Epsilon|Kappa", 1], ["|Alpha|Zeta", 0], ["|Alpha|Zeta|Lambda", 1], ["|Alpha|Zeta|Mu", 0], ["|Beta|Eta", 1], ["|Beta|Theta", 1], ]
If we are providing a true value to the
as_string
key, we also get to choose what character to use as the separator in thepath
column.$taxonomy_with_path_as_string_different_path_col_sep = $obj->pathify( { as_string => 1, path_col_sep => '~~', } );
Yields:
[ ["path", "is_actionable"], ["~~Alpha", 0], ["~~Beta", 0], ["~~Alpha~~Epsilon", 0], ["~~Alpha~~Epsilon~~Kappa", 1], ["~~Alpha~~Zeta", 0], ["~~Alpha~~Zeta~~Lambda", 1], ["~~Alpha~~Zeta~~Mu", 0], ["~~Beta~~Eta", 1], ["~~Beta~~Theta", 1], ]
Finally, should we want the
path
column in the returned arrayref to be named something other than path, we can provide a value to thepath_col
key.[ ["foo", "is_actionable"], [["", "Alpha"], 0], [["", "Beta"], 0], [["", "Alpha", "Epsilon"], 0], [["", "Alpha", "Epsilon", "Kappa"], 1], [["", "Alpha", "Zeta"], 0], [["", "Alpha", "Zeta", "Lambda"], 1], [["", "Alpha", "Zeta", "Mu"], 0], [["", "Beta", "Eta"], 1], [["", "Beta", "Theta"], 1], ]
item * Arguments
Optional single hash reference. If provided, the following keys may be used:
path_col
User-supplied name for column holding path information in the returned array reference. Defaults to
path
.as_string
Boolean. If supplied with a true value, path information will be represented as a separator-delimited string rather than an array reference.
path_col_sep
User-supplied string to be used to separate the parts of the route when
as_string
is called with a true value. Not meaningful unlessas_string
is true.
Return Value
Reference to an array of array references. The first element in the array will be a reference to an array of field names. Each succeeding element will be a reference to an array holding data for one record in the original taxonomy. The path data will be represented, by default, as an array reference built up from the component (
name
) column in the original taxonomy, but ifas_string
is selected, the path data in all non-header elements will be a separator-delimited string.