NAME
Data::BloomFilter::Shared - shared-memory Bloom filter for Linux
SYNOPSIS
use Data::BloomFilter::Shared;
# sized for 1_000_000 items at a 1% false-positive rate, anonymous mapping
my $bf = Data::BloomFilter::Shared->new(undef, 1_000_000, 0.01);
$bf->add("alice");
$bf->add("bob");
$bf->contains("alice"); # 1 (probably present)
$bf->contains("carol"); # 0 (definitely absent)
# "have I seen this before?" -- add returns 1 the first time, 0 after
my $first = $bf->add("event-42"); # 1: probably new
my $again = $bf->add("event-42"); # 0: all its bits were already set
# bulk add in a single lock acquisition
my $new = $bf->add_many([ map { "user-$_" } 1 .. 1000 ]);
# merge another filter of identical geometry (bitwise OR -> union)
my $other = Data::BloomFilter::Shared->new(undef, 1_000_000, 0.01);
$other->add_many([ map { "user-$_" } 500 .. 1500 ]);
$bf->merge($other);
# share across processes via a backing file
my $shared = Data::BloomFilter::Shared->new("/tmp/seen.bloom", 1_000_000, 0.01);
DESCRIPTION
A Bloom filter in shared memory: a compact, fixed-size structure for probabilistic set membership. You add items to it, then ask whether an item is in the set. The answer is either "definitely not present" or "probably present": the filter has no false negatives (if you added it, contains always returns true) but a tunable rate of false positives (it may occasionally report an item as present that was never added). It never stores the items themselves, only a bit array, so memory is proportional to the configured capacity and false-positive rate, not to the size of the items.
Each item is hashed once with XXH3 (128-bit); the two 64-bit halves drive k probe positions into a power-of-two bit array via Kirsch-Mitzenmacher double hashing. add sets those k bits; contains reports present only if all k are set. The number of hashes k and the bit-array size are derived from the capacity and fp_rate you request.
Because the bit array lives in a shared mapping, several processes share one filter: any process that opens the same backing file, inherits the anonymous mapping across fork, or reopens a passed memfd, sees the others' additions and contributes its own. A write-preferring futex rwlock with dead-process recovery guards mutation, so many processes may add and contains concurrently. Two filters of identical geometry can be combined with merge (bitwise OR), which yields a filter whose membership is the union of the two input sets.
Items are added and tested by their byte content; wide-character strings (any codepoint above 255) cause a "Wide character" croak -- encode such strings to bytes first (for example with Encode::encode_utf8). Linux-only. Requires 64-bit Perl.
METHODS
Constructors
my $bf = Data::BloomFilter::Shared->new($path, $capacity, $fp_rate);
my $bf = Data::BloomFilter::Shared->new(undef, 1_000_000); # fp_rate 0.01
my $bf = Data::BloomFilter::Shared->new_memfd($name, $capacity, $fp_rate);
my $bf = Data::BloomFilter::Shared->new_from_fd($fd);
$path is the backing file (undef or omitted for an anonymous mapping). $capacity is the number of items you expect to add; it must be at least 1. $fp_rate is the target false-positive rate at that capacity; it is optional, defaults to 0.01 (1%), and must be strictly between 0 and 1. new and new_memfd croak if $capacity is less than 1 or $fp_rate is out of range.
From $capacity and $fp_rate the filter derives its geometry: k = round(-log2(fp_rate)) hashes (clamped to the range 1..32), and a bit array of m = next_power_of_two(ceil(capacity * k / ln 2)) bits (with a floor of 64 bits). Rounding the bit count up to a power of two means the realised false-positive rate at capacity is typically at or below the configured target. When reopening an existing file or memfd, the stored geometry wins and the caller's $capacity/$fp_rate arguments are ignored. new_memfd creates a Linux memfd (transferable via its memfd descriptor); new_from_fd reopens one in another process.
Adding and testing
my $new = $bf->add($item); # 1 if probably new, else 0
my $added = $bf->add_many(\@items); # count of items that were probably new
my $in = $bf->contains($item); # 1 if probably present, 0 if definitely absent
$bf->clear; # reset to empty (all bits 0)
add hashes $item (taken by its bytes; wide characters croak, encode first) and sets its k bits, returning 1 if the item was probably new -- that is, if at least one of its bits was previously unset -- and 0 if all k bits were already set (the item was probably added before). A return of 0 is therefore a "probably seen this already" signal, subject to the same false-positive rate as contains. add_many takes an array reference and does the whole batch under a single write lock, returning how many of its elements were probably new.
contains returns 1 if the item is probably present and 0 if it is definitely absent. The contract is asymmetric and is the whole point of a Bloom filter: a 0 is exact (the item was never added), while a 1 may be a false positive (some other items happened to set all of this item's bits). There are never false negatives: any item you have added always reports as present.
Merging
$bf->merge($other);
Folds $other's bit array into $bf by bitwise OR, so $bf then reports as present the union of the two filters' item sets (still with no false negatives). Both filters must have identical geometry -- the same number of bits and the same number of hashes, which follows from constructing both with the same $capacity and $fp_rate (merge croaks on a mismatch). $other is read under its own lock into a private snapshot first, so merging is deadlock-free even if two processes merge each other concurrently; $other is not modified.
Introspection and lifecycle
$bf->capacity; $bf->bits; $bf->hashes; $bf->fp_rate; $bf->count; $bf->stats;
$bf->path; $bf->memfd; $bf->sync; $bf->unlink; # or Class->unlink($path)
capacity is the configured item capacity; bits is the bit-array size in bits (a power of two); hashes is the number of hash probes k; fp_rate is the configured target false-positive rate. count returns an estimate of the number of distinct items added, computed from the fraction of bits set (accurate while the filter is not saturated, and capped at capacity once it is). sync flushes the mapping to its backing store (a no-op for anonymous and memfd filters, which have none); unlink removes the backing file (also callable as Class->unlink($path)); path returns the backing path (undef for anonymous, memfd, or fd-reopened filters) and memfd the backing descriptor -- the memfd of a new_memfd filter or the dup'd fd of a new_from_fd filter, and -1 for file-backed or anonymous filters.
STATS
stats() returns a hashref describing the filter:
capacity-- the configured item capacity.fp_rate-- the configured target false-positive rate.bits-- the bit-array size in bits (a power of two).hashes-- the number of hash probeskper item.bits_set-- how many bits are currently set.count-- the estimated number of distinct items added.fill_ratio--bits_set / bits, between 0 and 1. As this approaches 0.5 the filter is near its designed capacity and the realised false-positive rate approaches the configured target; well above that the rate climbs.ops-- running count of mutating operations (add,add_many,merge,clear).mmap_size-- bytes of the shared mapping.
SHARING ACROSS PROCESSES
The filter lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls new($path, ...) on the same path with matching capacity and rate), an anonymous mapping inherited across fork, or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via SCM_RIGHTS, or via /proc/$pid/fd/$n) and reopened with new_from_fd($fd). Because the mapping is shared, every process adds into and tests against the same bit array, so membership reflects the union of what all of them have added.
# producer and consumer share one filter with no coordination
my $bf = Data::BloomFilter::Shared->new(undef, 100_000, 0.01); # before fork
unless (fork) { $bf->add_many([ map { "ev-$_" } 1 .. 1000 ]); exit }
wait;
print $bf->contains("ev-500") ? "seen\n" : "no\n"; # seen -- the child's add
SECURITY
The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes.
CRASH SAFETY
Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Each bit set is a single word store, so a crash leaves the filter consistent up to the last completed add. Limitation: PID reuse is not detected (very unlikely in practice).
SEE ALSO
Data::HyperLogLog::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the Data::*::Shared family.
AUTHOR
vividsnow
LICENSE
This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.