NAME

Tie::CacheHash - Maintains sorted lists of top entries

SYNOPSIS

use Tie::CacheHash;
tie %hash1, 'Tie::CacheHash', 10, 100;
tie %hash2, 'Tie::CacheHash', '5%', '10%';

DESCRIPTION

Of course you can get the "top 100" entries of any perl hash:

@top_keys = (sort my_sort_func keys %my_hash)[0..99];

But if your hash has more than a few thousand entries, that sort operation may take several seconds. And if you have tens of thousands of entries, the sort may take many minutes.

(If you are reading this documentation past the expiration date on the bottom of the carton, please adjust the numbers accordingly. Sorting is always problematic for sufficiently large n.)

Many programs will need to keep track of a "top 100" (or "bottom 100") to perform such operations as expiring the oldest items out of a cache. Sorting the entire array and skimming off the top items is not always an acceptable algorithm. Tie::CacheHash provides a simple and reasonably efficient solution. Its primary design goal is reasonable responsiveness on every operation, i.e. no unpredictable long delays, and it achieves this goal by avoiding the sorting of huge arrays.

The two parameters you pass after the classname are the minimum and maximum allowable size for the cache. The largest array the module will ever have to sort will be somewhat above the maximum (how much depends on the distribution of your data), so picking a good 'max' will help control the maximum delay you will experience.

A 'min' of 0 means it is OK for the cache to run dry and never replenish itself (###I think###), so you probably want a minimum of at least 1. A minimum/maximum of a very large integer (try 2**30) means to keep the whole hash in the cache.

Duplicate values are allowed; if you don't specify your own sort function, they will be secondarily sorted by key.

If you pass in a subhash, you MUST NOT alter its data directly: only through the CacheHash.

BUGS

There should be a way to store or delete large amounts of data at once, without cache overhead between each entry. For now, it should work to munge the {h} array directly and then call cache_rebuild, but that's an ugly workaround.

There should be a way to set cache_pos's secondary (key) sort function, instead of forcing "cmp". (This would mean a tertiary sort to ensure predictable sort order in case the user screws up and returns 0 for nonequal keys.)

Not sure yet whether the undefined value is handled properly in all cases. Should be tested with a subhash of a type known to choke and die horribly on undef (ideally, Tie::CacheHash would not provoke such behavior).

The percent-style min/max arguments don't work yet (but they're an awfully cool idea aren't they?).

Because it's not a full 1.0 release yet, "make test" still does a tremendous amount of randomized stress testing. This takes longer than it really should (typically 15-30 seconds, sometimes more). When it gets closer to release, this will be backed-off.

The Monte Carlo algorithm has a number of places where its performance could be further optimized. In some cases the impact is significant. These are marked in the code as "BUG".

The cmp() method probably has room for performance improvement if we make the very fair assumption that sort_func and sort_rev will not change during a sort!

The different DEBUG levels are neither well-thought-out nor documented.

Haven't tested setting cache minimum to 0.

Haven't tested setting cache maximum to 2**30.

AUTHOR

Jamie McCarthy <jamie@mccarthy.org>.