NAME

Data::SortedSet::Shared - shared-memory sorted set (ZSET) for Linux

SYNOPSIS

use Data::SortedSet::Shared;

# up to 1M members, anonymous shared mapping
my $z = Data::SortedSet::Shared->new(undef, 1_000_000);

$z->add(42, 1500);              # member 42 with score 1500
$z->add(7,  1500);              # ties broken by member id
$z->incr(42, 50);               # 42 -> 1550 (returns the new score)

my @top   = $z->rev_range_by_rank(0, 9);    # top 10 members (highest score)
my $rank  = $z->rank(42);                    # 0-based rank (lowest score = 0)
my $score = $z->score(42);
my @near  = $z->range_by_score(1400, 1600);  # members scored in [1400, 1600]

my ($m, $s) = $z->pop_min;       # remove + return the lowest

$z->each(sub { my ($member, $score) = @_; ... });   # in score order

DESCRIPTION

An ordered set in shared memory, in the spirit of a Redis sorted set: each member is a 64-bit integer carrying a double score, and members are kept in score order. It is backed by an order-statistics B+tree -- so rank, range, and pop are O(log n) and ranges scan sequentially through doubly-linked leaves -- paired with a member-to-score hash index, so score and exists are O(1).

The total order is (score, member): members with equal scores are ordered by member id, which gives a well-defined rank and a deterministic pop_min/pop_max.

Multiple processes can map the same set and read and write it concurrently; access is serialized by a write-preferring futex rwlock that recovers automatically if a lock holder dies (see "CRASH SAFETY").

Members are 64-bit integers. For string-keyed sets, see "String-keyed sets" and Data::SortedSet::Shared::Strings (bundled). Scores must not be NaN. Linux-only. Requires 64-bit Perl.

METHODS

Constructors

my $z = Data::SortedSet::Shared->new($path, $max);
my $z = Data::SortedSet::Shared->new(undef, $max);        # anonymous
my $z = Data::SortedSet::Shared->new_memfd($name, $max);
my $z = Data::SortedSet::Shared->new_from_fd($fd);

$path is the backing file (undef for an anonymous mapping); $max is the maximum number of members. When reopening an existing file or memfd, the stored header wins and the caller's $max is ignored. new_memfd creates a Linux memfd (transferable via its memfd descriptor); new_from_fd reopens one in another process.

String-keyed sets

my $z = Data::SortedSet::Shared->new_strings(max => 1_000_000);
$z->add("alice", 1500);
my @top = $z->rev_range_by_rank(0, 9);    # ("alice", ...)

new_strings returns a Data::SortedSet::Shared::Strings -- the same API as this class but with string members. Keys are interned to dense ids via Data::Intern::Shared (a prerequisite of this distribution), so the set is still shared across processes by id. Ties among equal scores break by interning id, not lexicographically. See Data::SortedSet::Shared::Strings for the full options (set/keys backing paths, max_keys, arena), and the separate wrap constructor that adopts two existing objects.

Mutators

$z->add($member, $score);     # 1 new, 0 existing (score updated), undef if full
$z->incr($member, $delta);    # add to the score (creating at $delta); returns new score
$z->remove($member);          # true if removed, false if absent
my $n = $z->add_many([ [$m1,$s1], [$m2,$s2], ... ]);   # bulk; returns count of new
$z->clear;

add inserts a new member or updates an existing member's score, returning 1 or 0 respectively, or undef if the pool is full and the member is new. $score may be any finite or infinite value but not NaN (croaks). incr creates an absent member at $delta (like Redis ZINCRBY) and croaks if the result would be NaN, or if the pool is full and the member is new.

add_many applies a whole batch under a single lock; each row is an [member, score] arrayref, malformed or NaN-scored rows are skipped, and it stops at $max. It returns the number of members newly inserted, which can be fewer than the number of new rows if the pool fills mid-batch.

Lookup and count

$z->score($member);           # the score, or undef if absent
$z->exists($member);
$z->count;                    # number of members
$z->rank($member);            # 0-based rank (lowest score = 0), or undef
$z->rev_rank($member);        # rank from the top, or undef
$z->count_in_score($min, $max);   # members with score in [min, max] (inclusive)

Rank and range

$z->at_rank($r);              # member at rank $r (negative counts from the end), or undef
my @m = $z->range_by_rank($start, $stop);       # members in [start .. stop] by rank
my @m = $z->rev_range_by_rank(0, 9);            # top 10 (highest scores first)
my @m = $z->range_by_score($min, $max, %opts);  # members scored in [min, max], ascending
my @m = $z->rev_range_by_score($max, $min, %opts);

Rank indices are 0-based and may be negative (counting from the end, like Perl slices); range_by_* bounds are inclusive. range_by_score / rev_range_by_score accept limit => $n and offset => $k (a negative offset is treated as 0). All range and rank methods return members; pass withscores => 1 for a flat (member, score, ...) list instead.

Pop and peek

my ($member, $score) = $z->pop_min;    # remove + return the lowest, or () if empty
my ($member, $score) = $z->pop_max;
my ($member, $score) = $z->peek_min;   # without removing
my ($member, $score) = $z->peek_max;

Iteration

$z->each(sub { my ($member, $score) = @_; ... });

each snapshots all members under the read lock, then invokes the callback once per member in score order after the lock is released, so the callback may safely call back into the set.

Introspection and lifecycle

$z->count; $z->max_entries; $z->stats;     # see STATS
$z->path; $z->memfd; $z->sync; $z->unlink;     # or Class->unlink($path)
$z->eventfd; $z->fileno; $z->notify; $z->eventfd_consume;

sync flushes the mapping to its backing store and unlink removes the backing file (also callable as Class->unlink($path)). path returns the backing file path (undef for an anonymous or memfd-backed set) and memfd returns the descriptor of a new_memfd set (-1 otherwise). The eventfd methods let another process wait for updates: eventfd lazily creates an eventfd and returns its descriptor (croaks on failure; calling it again returns the same fd), fileno returns the current eventfd descriptor or -1, notify writes a wakeup (returning false if no eventfd is attached), and eventfd_consume reads and resets the counter, returning it as an integer or undef when nothing is pending.

SHARING ACROSS PROCESSES

The set lives in a shared mapping, so several processes operate on the same data with no serialization layer in between. There are three ways to share it:

  • A backing file -- every process calls new($path, $max) on the same path. The first to arrive creates and sizes the file (serialized by an exclusive lock); the rest map it.

  • An anonymous mapping inherited across fork -- create with new(undef, $max) before forking; the parent and its children then share the one mapping.

  • A memfd -- create with new_memfd($name, $max) and hand its memfd descriptor to an unrelated process (over a UNIX socket with SCM_RIGHTS, or while the creator is alive via /proc/$pid/fd/$n), which reopens it with new_from_fd($fd).

# children populate a fork-shared set; the parent reads the result
my $z = Data::SortedSet::Shared->new(undef, 1_000_000);
for my $k (1 .. 4) {
    unless (fork) {                                  # child
        $z->add($k * 1_000_000 + $_, rand) for 1 .. 1000;
        exit;
    }
}
1 while wait != -1;                                  # reap children
print $z->count, "\n";                               # 4000

Every operation is serialized by the rwlock, so concurrent writers do not corrupt the tree. A writer can wake readers blocked in other processes through the eventfd interface: it calls notify after a batch, and a reader selects on fileno then drains the count with eventfd_consume.

COMPLEXITY

score/exists/peek_* are O(1); add/remove/incr/rank/ at_rank/pop_* and locating a range bound are O(log n); a range or iteration of k members is O(log n + k), scanning sequentially through the linked leaves.

STATS

stats() returns a hashref with keys: count, max_entries, height (B+tree height), node_capacity, nodes_used, index_slots, index_load (occupied fraction of the member index), ops (running count of write-path calls, whether or not they changed the set), and mmap_size (bytes).

SECURITY

The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes.

CRASH SAFETY

The write lock is a futex-based rwlock with PID-encoded ownership; if a writer dies while holding it, the next writer detects the dead owner and recovers. Reader slots are reclaimed similarly. Limitation: PID reuse is not detected, which is very unlikely in practice but cannot be ruled out.

SEE ALSO

Data::SortedSet::Shared::Strings (string-keyed variant, bundled with this distribution), Data::Intern::Shared, Data::SpatialHash::Shared, Data::HashMap::Shared, Data::Heap::Shared, Data::Graph::Shared, and the rest of the Data::*::Shared family.

AUTHOR

vividsnow

LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.