NAME

Data::CountMinSketch::Shared - shared-memory Count-Min sketch for Linux

SYNOPSIS

use Data::CountMinSketch::Shared;

# epsilon (error factor) 0.1%, delta (failure prob) 0.1%, anonymous mapping
my $cms = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001);

$cms->add("alice");                 # count "alice" once
$cms->add("bob", 5);                # count "bob" five times

$cms->estimate("alice");            # 1  (never less than the true count)
$cms->estimate("bob");              # 5
$cms->estimate("carol");            # 0  (never added)

# bulk add in a single lock acquisition (each element counted once)
$cms->add_many([ map { "user-$_" } 1 .. 1000 ]);

# merge another sketch of identical geometry (cellwise add -> summed streams)
my $other = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001);
$other->add_many([ map { "user-$_" } 500 .. 1500 ]);
$cms->merge($other);

# share across processes via a backing file
my $shared = Data::CountMinSketch::Shared->new("/tmp/freq.cms", 0.001, 0.001);

DESCRIPTION

A Count-Min sketch in shared memory: a compact, fixed-size structure for approximate frequency estimation over a stream. You add items (optionally with a count), then ask for the estimated number of times any item has been added. Memory is proportional to the configured error parameters, not to the number of distinct items or the size of the items; the sketch never stores the items themselves, only a small matrix of counters.

The estimate has a one-sided guarantee: it never underestimates the true count, and overestimates by at most epsilon * total with probability at least 1 - delta, where total is the sum of all increments. (An item never added estimates as 0 unless hash collisions with other items inflate every one of its cells.) This makes the sketch ideal for finding heavy hitters and approximate counts in a stream that is too large to count exactly.

Each item is hashed once with XXH3 (128-bit); the two 64-bit halves drive one column per row (d-row double hashing) into a d x w matrix of 64-bit counters, with w a power of two. add increments the d cells of the item (one per row); estimate returns the minimum of those d cells -- since every collision only ever adds to a cell, the smallest cell is the tightest upper bound, and is exact when at least one of the item's cells suffered no collision. The matrix width w and depth d are derived from the epsilon and delta you request.

Because the matrix lives in a shared mapping, several processes share one sketch: any process that opens the same backing file, inherits the anonymous mapping across fork, or reopens a passed memfd, sees the others' additions and contributes its own. A write-preferring futex rwlock with dead-process recovery guards mutation, so many processes may add and estimate concurrently. Two sketches of identical geometry can be combined with merge (cellwise add), which yields a sketch whose counts are the sum of the two input streams -- the merged estimate of any item equals the sum of its estimates in the two inputs.

Items are added and queried by their byte content; wide-character strings (any codepoint above 255) cause a "Wide character" croak -- encode such strings to bytes first (for example with Encode::encode_utf8). Linux-only. Requires 64-bit Perl.

METHODS

Constructors

my $cms = Data::CountMinSketch::Shared->new($path, $epsilon, $delta);
my $cms = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001);   # defaults
my $cms = Data::CountMinSketch::Shared->new_memfd($name, $epsilon, $delta);
my $cms = Data::CountMinSketch::Shared->new_from_fd($fd);

$path is the backing file (undef or omitted for an anonymous mapping). $epsilon is the target error factor and $delta the target failure probability; both are optional, default to 0.001, and must be strictly between 0 and 1. new and new_memfd croak if $epsilon or $delta is out of range.

From $epsilon and $delta the sketch derives its geometry: a width of w = next_power_of_two(ceil(e / epsilon)) columns (with a floor of 2 columns) and a depth of d = ceil(ln(1 / delta)) rows (clamped to the range 1..32). Rounding the width up to a power of two means the realised error factor at any given total is typically at or below the configured target. When reopening an existing file or memfd, the stored geometry wins and the caller's $epsilon/$delta arguments are ignored. new_memfd creates a Linux memfd (transferable via its memfd descriptor); new_from_fd reopens one in another process.

This is the standard Count-Min sketch (plain cell increments); it does not use the conservative-update variant, which would make merge unsound. The plain construction is what guarantees that merging two sketches is exactly equivalent to having counted both streams into one.

Adding and estimating

my $total = $cms->add($item);           # add 1; returns the new grand total
my $total = $cms->add($item, $n);       # add $n; returns the new grand total
my $count = $cms->add_many(\@items);     # add 1 per element; returns how many added
my $est   = $cms->estimate($item);       # estimated count of $item (>= true count)
$cms->clear;                             # reset every counter (and total) to 0

add hashes $item (taken by its bytes; wide characters croak, encode first) and increments its d cells by $n (default 1), returning the new grand total -- the running sum of all increments across all items. $n is an unsigned integer. add_many takes an array reference and adds each element once under a single write lock, returning the number of elements added (the array's length).

estimate returns the estimated number of times $item has been added: the minimum of its d cells. This value never underestimates the true count, and exceeds it by at most epsilon * total with probability at least 1 - delta. An item that was never added estimates as 0 unless every one of its cells happens to collide with other items.

Merging

$cms->merge($other);

Folds $other's counter matrix into $cms by cellwise addition, so $cms then estimates, for every item, the sum of that item's counts in the two sketches; $cms's total likewise becomes the sum of the two totals. Both sketches must have identical geometry -- the same width and depth, which follows from constructing both with the same $epsilon and $delta (merge croaks on a mismatch). $other is read under its own lock into a private snapshot first, so merging is deadlock-free even if two processes merge each other concurrently; $other is not modified. Cells that would overflow a 64-bit counter saturate at the maximum value.

Introspection and lifecycle

$cms->total; $cms->width; $cms->depth; $cms->cells; $cms->stats;
$cms->path; $cms->memfd; $cms->sync; $cms->unlink;   # or Class->unlink($path)

total is the running sum of all increments; width is the column count w (a power of two); depth is the row count d; cells is width * depth, the number of counters. sync flushes the mapping to its backing store (a no-op for anonymous and memfd sketches, which have none); unlink removes the backing file (also callable as Class->unlink($path)); path returns the backing path (undef for anonymous, memfd, or fd-reopened sketches) and memfd the backing descriptor -- the memfd of a new_memfd sketch or the dup'd fd of a new_from_fd sketch, and -1 for file-backed or anonymous sketches.

STATS

stats() returns a hashref describing the sketch:

  • width -- the column count w (a power of two).

  • depth -- the row count d.

  • total -- the running sum of all increments.

  • cells -- width * depth, the number of 64-bit counters.

  • epsilon -- the achieved error factor, e / width. The per-item overestimate is bounded by epsilon * total (with probability 1 - delta); a smaller value is a tighter bound.

  • delta -- the achieved failure probability, exp(-depth). This is the chance that the overestimate exceeds the epsilon * total bound for a given item; a smaller value is a stronger guarantee.

  • ops -- running count of mutating operations (add, add_many, merge, clear).

  • mmap_size -- bytes of the shared mapping.

SHARING ACROSS PROCESSES

The sketch lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls new($path, ...) on the same path with matching epsilon and delta), an anonymous mapping inherited across fork, or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via SCM_RIGHTS, or via /proc/$pid/fd/$n) and reopened with new_from_fd($fd). Because the mapping is shared, every process adds into and estimates against the same counter matrix, so the counts reflect the combined stream all of them have added.

# producer and consumer share one sketch with no coordination
my $cms = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001);   # before fork
unless (fork) { $cms->add("ev-500") for 1 .. 10; exit }
wait;
print $cms->estimate("ev-500"), "\n";   # >= 10 -- the child's adds

SECURITY

The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes.

CRASH SAFETY

Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Each cell increment is a single word store, so a crash leaves the sketch consistent up to the last completed add. Limitation: PID reuse is not detected (very unlikely in practice).

SEE ALSO

Data::BloomFilter::Shared, Data::HyperLogLog::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the Data::*::Shared family.

AUTHOR

vividsnow

LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.