NAME
Parse::Taxonomy::AdjacentList - Extract a taxonomy from a hierarchy inside a CSV file
SYNOPSIS
use Parse::Taxonomy::AdjacentList;
$source = "./t/data/alpha.csv";
$self = Parse::Taxonomy::AdjacentList->new( {
file => $source,
} );
METHODS
new()
Purpose
Parse::Taxonomy::AdjacentList constructor.
Arguments
Single hash reference. There are two possible interfaces:
fileandcomponents.- 1
fileinterface -
$source = "./t/data/delta.csv"; $self = Parse::Taxonomy::AdjacentList->new( { file => $source, } );Elements in the hash reference are keyed on:
fileAbsolute or relative path to the incoming taxonomy file. Required for this interface.
id_colThe name of the column in the header row under which each data record's unique ID can be found. Defaults to
id.parent_id_colThe name of the column in the header row under which each data record's parent ID can be found. (Will be empty in the case of top-level nodes, as they have no parent.) Defaults to
parent_id.leaf_colThe name of the column in the header row under which, in each data record, there is a found a string which differentiates that record from all other records with the same parent ID. Defaults to
name.Text::CSV_XS options
Any other options which could normally be passed to
Text::CSV_XS->new()will be passed through to that module's constructor. On the recommendation of the Text::CSV documentation,binaryis always set to a true value.
- 2
componentsinterface -
$self = Parse::Taxonomy::AdjacentList->new( { components => { fields => $fields, data_records => $data_records, } } );Elements in this hash are keyed on:
componentsThis element is required for the
componentsinterface. The value of this element is a hash reference with two keys,fieldsanddata_records.fieldsis a reference to an array holding the field or column names for the data set.data_recordsis a reference to an array of array references, each of the latter arrayrefs holding one record or row from the data set.
- 1
Return Value
Parse::Taxonomy::AdjacentList object.
Exceptions
new()will throw an exception under any of the following conditions:Argument to
new()is not a reference.Argument to
new()is not a hash reference.Argument to
new()must have either 'file' or 'components' element but not both.Lack columns in header row to match requirements.
Non-numeric entry in
idorparent_idcolumn.Duplicate entries in
idcolumn.Number of fields in a data record does not match number in header row.
Empty string in a
componentcolumn of a record.Unable to locate a record whose
idis theparent_idof a different record.No records with same
parent_idmay share value ofcomponentcolumn.fileinterfaceIn the
fileinterface, unable to locate the file which is the value of thefileelement.The same field is found more than once in the header row of the incoming taxonomy file.
Unable to open or close the incoming taxonomy file for reading.
componentsinterfaceIn the
componentsinterface,componentselement must be a hash reference withfieldsanddata_recordselements.fieldselement must be array reference.data_recordselement must be reference to array of array references.No duplicate fields in
fieldselement's array reference.
fields()
Purpose
Identify the names of the columns in the taxonomy.
Arguments
my $fields = $self->fields();No arguments; the information is already inside the object.
Return Value
Reference to an array holding a list of the columns as they appear in the header row of the incoming taxonomy file.
Comment
Read-only.
data_records()
Purpose
Once the taxonomy has been validated, get a list of its data rows as a Perl data structure.
Arguments
$data_records = $self->data_records;None.
Return Value
Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.
Comment
Does not contain any information about the fields in the taxonomy, so you should probably either (a) use in conjunction with
fields()method above; or (b) usefields_and_data_records().
get_field_position()
Purpose
Identify the index position of a given field within the header row.
Arguments
$index = $self->get_field_position('income');Takes a single string holding the name of one of the fields (column names).
Return Value
Integer representing the index position (counting from
0) of the field provided as argument. Throws exception if the argument is not actually a field.
Accessors
The following methods provide information about key columns in a Parse::Taxonomy::MaterializedPath object. The key columns are those which hold the ID, parent ID and component information. They take no arguments. The methods whose names end in _idx return integers, as they return the index position of the column in the header row. The other methods return strings.
$index_of_id_column = $self->id_col_idx;
$name_of_id_column = $self->id_col;
$index_of_parent_id_column = $self->parent_id_col_idx;
$name_of_parent_id_column = $self->parent_id_col;
$index_of_leaf_column = $self->leaf_col_idx;
$name_of_leaf_column = $self->leaf_col;
pathify()
Purpose
Generate a new Perl data structure which holds the same information as a Parse::Taxonomy::AdjacentList object but which expresses the route from the root node to a given branch or leaf node as either a separator-delimited string (as in the
pathcolumn of a Parse::Taxonomy::MaterializedPath object) or as an array reference holding the list of names which delineate that route.Another way of expressing this: Transform a taxonomy-by-adjacent-list to a taxonomy-by-materialized-path.
Example: Suppose we have a CSV file which serves as a taxonomy-by-adjacent-list for this data:
"id","parent_id","name","is_actionable" "1","","Alpha","0" "2","","Beta","0" "3","1","Epsilon","0" "4","3","Kappa","1" "5","1","Zeta","0" "6","5","Lambda","1" "7","5","Mu","0" "8","2","Eta","1" "9","2","Theta","1"Instead of having the route from the root node to a given node be represented implicitly by following
parent_ids up the tree, suppose we want that route to be represented by a string. Assuming that we work with default column names, that would mean representing the information currently spread out among theid,parent_idandnamecolumns in a singlepathcolumn which, by default, would hold an array reference.$source = "./t/data/theta.csv"; $self = Parse::Taxonomy::AdjacentList->new( { file => $source, } ); $taxonomy_with_path_as_array = $self->pathify;Yielding:
[ ["path", "is_actionable"], [["", "Alpha"], 0], [["", "Beta"], 0], [["", "Alpha", "Epsilon"], 0], [["", "Alpha", "Epsilon", "Kappa"], 1], [["", "Alpha", "Zeta"], 0], [["", "Alpha", "Zeta", "Lambda"], 1], [["", "Alpha", "Zeta", "Mu"], 0], [["", "Beta", "Eta"], 1], [["", "Beta", "Theta"], 1], ]If we wanted the path information represented as a string rather than an array reference, we would say:
$taxonomy_with_path_as_string = $self->pathify( { as_string => 1 } );Yielding:
[ ["path", "is_actionable"], ["|Alpha", 0], ["|Beta", 0], ["|Alpha|Epsilon", 0], ["|Alpha|Epsilon|Kappa", 1], ["|Alpha|Zeta", 0], ["|Alpha|Zeta|Lambda", 1], ["|Alpha|Zeta|Mu", 0], ["|Beta|Eta", 1], ["|Beta|Theta", 1], ]If we are providing a true value to the
as_stringkey, we also get to choose what character to use as the separator in thepathcolumn.$taxonomy_with_path_as_string_different_path_col_sep = $self->pathify( { as_string => 1, path_col_sep => '~~', } );Yields:
[ ["path", "is_actionable"], ["~~Alpha", 0], ["~~Beta", 0], ["~~Alpha~~Epsilon", 0], ["~~Alpha~~Epsilon~~Kappa", 1], ["~~Alpha~~Zeta", 0], ["~~Alpha~~Zeta~~Lambda", 1], ["~~Alpha~~Zeta~~Mu", 0], ["~~Beta~~Eta", 1], ["~~Beta~~Theta", 1], ]Finally, should we want the
pathcolumn in the returned arrayref to be named something other than path, we can provide a value to thepath_colkey.[ ["foo", "is_actionable"], [["", "Alpha"], 0], [["", "Beta"], 0], [["", "Alpha", "Epsilon"], 0], [["", "Alpha", "Epsilon", "Kappa"], 1], [["", "Alpha", "Zeta"], 0], [["", "Alpha", "Zeta", "Lambda"], 1], [["", "Alpha", "Zeta", "Mu"], 0], [["", "Beta", "Eta"], 1], [["", "Beta", "Theta"], 1], ]item * Arguments
Optional single hash reference. If provided, the following keys may be used:
path_colUser-supplied name for column holding path information in the returned array reference. Defaults to
path.as_stringBoolean. If supplied with a true value, path information will be represented as a separator-delimited string rather than an array reference.
path_col_sepUser-supplied string to be used to separate the parts of the route when
as_stringis called with a true value. Not meaningful unlessas_stringis true.
Return Value
Reference to an array of array references. The first element in the array will be a reference to an array of field names. Each succeeding element will be a reference to an array holding data for one record in the original taxonomy. The path data will be represented, by default, as an array reference built up from the component (
name) column in the original taxonomy, but ifas_stringis selected, the path data in all non-header elements will be a separator-delimited string.
write_pathified_to_csv()
Purpose
Create a CSV-formatted file holding the data returned by
pathify().Arguments
$csv_file = $self->write_pathified_to_csv( { pathified => $pathified, # output of pathify() csvfile => './t/data/taxonomy_out5.csv', } );Single hash reference. That hash is keyed on:
pathifiedRequired: Its value must be the arrayref of hash references returned by the
pathify()method.csvfileOptional. Path to location where a CSV-formatted text file holding the taxonomy-by-adjacent-list will be written. Defaults to a file called taxonomy_out.csv in the current working directory.
Text::CSV_XS options
You can also pass through any key-value pairs normally accepted by Text::CSV_XS.
Return Value
Returns path to CSV-formatted text file just created.
Example
Suppose we have a CSV-formatted file holding the following taxonomy-by-adjacent-list:
"id","parent_id","name","is_actionable" "1","","Alpha","0" "2","","Beta","0" "3","1","Epsilon","0" "4","3","Kappa","1" "5","1","Zeta","0" "6","5","Lambda","1" "7","5","Mu","0" "8","2","Eta","1" "9","2","Theta","1"After running this file through
new(),pathify()andwrite_pathified_to_csv()we will have a new CSV-formatted file holding this taxonomy-by-materialized-path:path,is_actionable |Alpha,0 |Beta,0 |Alpha|Epsilon,0 |Alpha|Epsilon|Kappa,1 |Alpha|Zeta,0 |Alpha|Zeta|Lambda,1 |Alpha|Zeta|Mu,0 |Beta|Eta,1 |Beta|Theta,1Note that the
id,parent_idandnamecolumns have been replaced by the <path> column.