NAME

Hash::Type - restricted, ordered hashes as arrays tied to a "type" (shared list of keys)

SYNOPSIS

use Hash::Type;

# create a Hash::Type
my $person_type = Hash::Type->new(qw/firstname lastname city/);

# create and populate some hashes tied to $person_type
tie my(%wolfgang), $person_type, "wolfgang amadeus", "mozart", "salzburg";
my $ludwig = $person_type->new("ludwig", "van beethoven", "vienna");
my $jsb    = $person_type->new;
$jsb->{city} = "leipzig";
@{$jsb}{qw/firstname lastname/} = ("johann sebastian", "bach");

# add fields dynamically
$person_type->add("birth", "death") or die "fields not added";
$wolfgang{birth} = 1750;

# get back ordered names or values
my @fields = $person_type->names;              # same as: keys %wolfgang
my @vals   = $person_type->values(\%wolfgang); # same as: values %wolfgang

# More complete example : read a flat file with headers on first line
my ($headerline, @datalines) = map {chomp; $_} <F>;
my $ht = Hash::Type->new(split /\t/, $headerline);
foreach my $line (@datalines) {
  my $data = $ht->new(split /\t/, $line);
  work_with($data->{some_field}, $data->{some_other_field});
}

# get many lines from a DBI database
my $sth = $dbh->prepare($sql);
$sth->execute;
my $ht = Hash::Type->new(@{$sth->{NAME}});
while (my $r = $sth->fetchrow_arrayref) {
  my $row = $ht->new(@$r);
  work_with($row);
}

# an alternative to Time::gmtime and Time::localtime
my $time_type  = Hash::Type->new(qw/sec min hour mday mon year wday yday/);
my $localtime  = $time_type->new(localtime);
my $gmtime     = $time_type->new(gmtime);
print $localtime->{hour} - $gmtime->{hour}, " hours difference to GMT";

# an alternative to File::Stat or File::stat or File::Stat::OO
my $stat_type  = Hash::Type->new(qw/dev ino mode nlink uid gid rdev
                                    size atime mtime ctime blksize blocks/);
my $stat       = $stat_type->new(stat $my_file);
print "$my_file has $stat->{size} bytes and $stat->{blocks} blocks";


# comparison functions
my $by_age         = $person_type->cmp("birth : -num, lastname, firstname");
my $by_name_length = $person_type->cmp(
  lastname  => {length($b) <=> length($a)},
  lastname  => 'alpha',
  firstname => 'alpha',
);
show_person($_) foreach (sort $by_age         @people);
show_person($_) foreach (sort $by_name_length @people);

# special comparisons : dates
my $US_date_cmp         = $my_hash_type->cmp("some_date_field : m/d/y");
my $FR_inverse_date_cmp = $my_hash_type->cmp("some_date_field : -d.m.y");

DESCRIPTION

An instance of Hash::Type encapsulates a collection of field names, and is used to generate tied hashes, implemented internally as arrayrefs, and sharing the common list of fields.

The original motivation for this design was to spare memory, since the field names are shared. As it turns out, benchmarks show that this goal is not attained : memory usage is about 35% higher than Perl native hashes. However, this module also implements restricted hashes (hashes with a fixed set of keys that cannot be expanded) and of ordered hashes (the list of keys or list values are returned in a fixed order); and for those two functionalities, the performances of Hash::Type are very competitive with respect to those of other similar modules, both in terms of CPU and memory usage (see the "BENCHMARKS" section at the end of the documentation). In addition, Hash::Type offers an API for convenient and very efficient sorting of lists of tied hashes, and alternative methods for keys, values and each, faster than the "perltie" API.

In conclusion, this module is well suited for any need of restricted and/or ordered hashes, and for situations dealing with large collections of homogeneous hashes, like for example data rows coming from spreadsheets or databases.

METHODS

new

The new() method is polymorphic : it can be used both as a class and as an instance method.

new() as a class method

my $h_type = Hash::Type->new(@names);

Creates a new instance which holds a collection of names and associated indices (technically, this is a hash reference blessed in package Hash::Type). This instance can then be used to generate tied hashes. The list of @names is optional ; names can be added later through the add method.

new() as an instance method

my $tied_hashref = $h_type->new(@vals);

Creates a new tied hash associated to the Hash::Type class and containing a reference to the $h_type instance. Internally, the tied hash is implemented as an array reference.

TIE interface

Tied hash creation

The tie syntax is an alternative to the new() method seen above :

tie my(%hash), $h_type, @vals;

This is a bit unusual since the official syntax for the tie function is tie VARIABLE, CLASSNAME, LIST; here the second argument is not a classname, but an object. It works well, however, because the TIEHASH call merely passes the second argument to the implementation, without any check that this is a real scalar classname.

Accessing names in tied hashes

Access to $hash{name} works like for a regular Perl hash. It is equivalent to writing

tied(%hash)->[$h_type->{name}]

where $h_type is the Hash::Type instance. That instance can be retrieved through the special, reserved name $hash{'Hash::Type'}; you may need it for example to generate a comparison function, as explained below. The reserved name $hash{'Hash::Type'} can be read but it does not belong to keys %hash.

The operation delete $hash{name} is forbidden since it would break the consistency of the internal arrayref. Any attempt to delete a name will generate an exception. Of course it is always possible to set a field to undef.

Iterating on keys

Standard Perl operations keys, values and each on tied hashes preserve the order in which names were added to the Hash::Type instance.

The same behavior can be obtained through object-oriented method calls, described below.

add

$h_type->add(@new_names);

Adds @new_names in $h_type and gives them new indices. Does nothing for names that were already present. Returns the number of names actually added.

Existing hashes already tied to $h_type are not touched; their internal arrayrefs are not expanded, but the new fields will spring into existence as soon as they are assigned a value, thanks to Perl's auto-vivification mechanism.

names

my @headers = $h_type->names;

Returns the list of defined names, in index order. This is the same list as keys %hash, but computed faster because it directly returns a copy of the underlying name array instead of having to iterate through the keys.

values

my @vals = $h_type->values(\%hash);

Returns the list of values in %hash, in index order. This is the same list as values %hash, but computed faster.

each

my $iterator = $h_type->each(\%hash);
while (my ($key, $value) = $iterator->()) {
  # ...
}

Returns an iterator function that yields pairs of ($key, $value) at each call, until reaching an empty list at the end of the tied hash. This is the same as calling each %hash, but works faster.

cmp

first syntax : one single argument

my $cmp = $h_type->cmp("f1 : cmp1, f2 : cmp2 , ...")

Returns a reference to an anonymous sub which successively compares the given field names, applying the given operators, and returns a positive, negative or zero value. This sub can then be fed to sort. 'f1', 'f2', etc are field names, 'cmp1', 'cmp2' are comparison operators written as :

[+|-] [alpha|num|cmp|<=>|d.m.y|d/m/y|y-m-d|...]

The sign is '+' for ascending order, '-' for descending; default is '+'. Operator 'alpha' is synonym to 'cmp' and 'num' is synonym to '<=>'; operators 'd.m.y', 'd/m/y', etc. are for dates in various formats; default is 'alpha'. So for example

my $cmp = $h_type->cmp("foo : -alpha, bar : num");

will sort alphabetically on foo in descending order; in case of identical values for the foo field, it will sort numerically on the bar field.

If all you want is alphabetic ascending order, just write the field names :

my $cmp = $person_type->cmp('lastname, firstname');

Note : sort will not accept something like

sort $person_type->cmp('lastname, firstname') @people;

so you have to store it in a variable first :

my $cmp = $person_type->cmp('lastname', 'firstname');
sort $cmp @people;

For date comparisons, values are parsed into day/month/year, according to the shape specified (for example 'd.m.y') will take '.' as a separator. Day, month or year need not be several digits, so '1.1.1' will be interpreted as '01.01.2001'. Years of 2 or 1 digits are mapped to 2000 or 1900, with pivot at 33 (so 32 becomes 2032 and 33 becomes 1933).

second syntax : pairs of (field_name => comparison_specification)

my $cmp = $h_type->cmp(f1 => cmp1, f2 => cmp2, ...);

This second syntax, with pairs of field names and operators, is a bit more verbose but gives you more flexibility, as you can write your own comparison functions using $a and $b :

my $by_name_length = $person_type->cmp(
  lastname  => {length($b) <=> length($a)},
  lastname  => 'alpha',
  firstname => 'alpha',
);

Note : the resulting closure is bound to the special variables $a and $b. Since those are different in each package, you cannot pass the comparison function to another package : the call to sort has to be done in the package where the comparison function was compiled.

INTERNALS

A Hash::Type instance is a blessed hashref, in which each declared name is associated with a corresponding index (starting at 1).

In addition, the hashref contains a private key \0HTkeys\0 holding an arrayref to the ordered list of names, in order to provide a fast implementation for the NEXTKEY operation. In the unlikely case that this private key would cause a conflict, it can be changed by setting $Hash::Type::reserved_keys_field to a different value.

Tied hashes are implemented as arrayrefs, in which slot 0 points to the Hash::Type instance, and slots 1 to n are the values for the named fields.

The particular thing about this module is that it has two different families of instances blessed into the same class :

  • blessed hashrefs are "types", holding a collection of field names, answering to object-oriented methods of the public interface (i.e. methods "add", "names", etc.)

  • blessed arrayrefs are implementations of tied hashes, answering to implicit method calls of the "perltie" API (i.e methods FIRSTKEY, NEXTKEY, EXISTS, etc.).

SEE ALSO

The 'pseudo-hashes' documented in perlref were very similar, but were deprecated since Perl 5.8.0.

For other ways to restrict the keys of a hash to a fixed set, see "lock_keys" in Hash::Util, Tie::Hash::FixedKeys, Tie::StrictHash.

For other ways to implement ordered hashes, see Tie::IxHash, Hash::Ordered, Tie::Hash::Indexed, and many others.

The Sort::Fields module in CPAN uses techniques similar to the present "cmp" method for dynamically building sorting criterias according to field positions; but it is intended for numbered fields, not for named fields, and has no support for caller-supplied comparison operators. The design is also a bit different : fieldsort does everything at once (splitting, comparing and sorting), whereas Hash::Type::cmp only compares, and leaves it to the caller to do the rest.

Hash::Type was primarily designed as a core element for implementing rows of data in File::Tabular.

BENCHMARKS

A benchmark program is supplied within the distribution to compare performances of Hash::Type to some other solutions. Compared operations were hash creation, data access, data update, sorting, and deletion, varying the number of tied hashes, the number of keys, and also the length of hash keys (surprisingly, this can make a difference!).

To give an idea, here are just a few results :

200000 records with 10 keys of  1 chars
 create  update  access    sort  delete   memory
======= ======= ======= ======= ======= ========
  4.181   2.246   0.812   0.062   0.016  132.5MB (perl core hashes)
  2.683   5.366   1.357   0.110   0.078  227.4MB (Hash::Type v2.00)
  3.386   5.740   1.358   5.194   0.110  524.0MB (Hash::Ordered v0.010)
  5.756   7.005   1.451   5.818   0.109  539.8MB (Tie::IxHash v1.23)
  3.167   4.196   1.279   5.258   0.171  508.2MB (Tie::Hash::Indexed v0.05)

200000 records with  5 keys of 20 chars
 create  update  access    sort  delete   memory
======= ======= ======= ======= ======= ========
  1.482   1.201   0.811   0.078   0.031  132.5MB (perl core hashes)
  1.669   2.933   1.326   0.125   0.063  180.0MB (Hash::Type v2.00)
  2.200   3.120   1.341   5.148   0.109  524.0MB (Hash::Ordered v0.010)
  3.541   3.744   1.498   5.257   0.141  539.8MB (Tie::IxHash v1.23)
  1.826   2.215   1.279   5.164   0.172  506.2MB (Tie::Hash::Indexed v0.05)

100000 records with 50 keys of  2 chars
 create  update  access    sort  delete   memory
======= ======= ======= ======= ======= ========
  6.349   5.382   0.406   0.062   0.171  384.5MB (perl core hashes)
  5.429  13.384   0.702   0.063   0.110  432.0MB (Hash::Type v2.00)
  9.048  14.211   0.687   1.232   0.577 1180.7MB (Hash::Ordered v0.010)
 14.508  17.113   0.765   1.513   0.702 1354.6MB (Tie::IxHash v1.23)
  7.893  11.997   0.670   1.358   1.107 1542.4MB (Tie::Hash::Indexed v0.05)

The conclusions (confirmed by other results not displayed here) are that Hash::Type is quite reasonable in CPU and memory usage, with respect to other modules implementing ordered hashes. It even sometimes outperforms perl core hashes on creation time. It is especially good at sorting, but this due to its API allowing to compile a sorting function before applying it to a list of hashes.

Among the competitors Tie::Hash::Indexed is the fastest, which is not suprising as it is implemented in XS, while all other are implemented in pure Perl; notice however that this module is quite costly in terms of memory.

Tie::IxHash, which was among the first modules on CPAN for ordered hashes, is definitely the slowest and is quite greedy for memory.

Hash::Ordered, a more recent proposal, sits in the middle; however, it should be noted that the benchmark was based on its "perltie" API, which is probably not the most efficient way to use this module.

AUTHOR

Laurent Dami, <laurent.dami AT cpan dot org>

COPYRIGHT AND LICENSE

Copyright 2005, 2016 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.