NAME
Bit::Set::DB - Perl interface for bitset containers from the Bit
C library
VERSION
version 0.04
SYNOPSIS
use Bit::Set::DB;
use Bit::Set;
# Create a new bitset database
my $db = BitDB_new(1024, 10);
# Create a bitset and add it to the database
my $set = Bit::Set::Bit_new(1024);
Bit::Set::Bit_bset($set, 42);
BitDB_put_at($db, 0, $set);
# Get population count at index
my $count = BitDB_count_at($db, 0);
# Free the database and bitset
BitDB_free(\$db);
Bit::Set::Bit_free(\$set);
DESCRIPTION
This module provides a procedural Perl interface to the C library Bit
, for creating and manipulating containers of bitsets (BitDB). It uses FFI::Platypus
to wrap the C functions and Alien::Bit
to locate and link to the C library. The main purpose of this library is to provide multithreaded and hardware accelerated (e.g. GPU) versions of container operations e.g. forming the population count of the intersection of two containers of bitsets.
The API is a direct mapping of the C functions. For detailed semantics of each function, please refer to the bit.h
header file documentation.
Runtime checks on arguments are performed if the DEBUG
environment variable is set to a true value.
GPU offloading is disabled if you set up the NOGPU
environment variable.
FUNCTIONS
Creation and Destruction
- BitDB_new(length, num_of_bitsets)
-
Creates a new bitset container for
num_of_bitsets
bitsets, each oflength
. - BitDB_free(db_ref)
-
Frees the memory associated with the bitset container. Expects a reference to the scalar holding the DB object.
Properties
- BitDB_length(set)
-
Returns the length of bitsets in the container.
- BitDB_nelem(set)
-
Returns the number of bitsets in the container.
- BitDB_count_at(set, index)
-
Returns the population count of the bitset at the given
index
. - BitDB_count(set)
-
Returns a pointer to an array of population counts for all bitsets in the container.
Manipulation
- BitDB_get_from(set, index)
-
Returns a bitset from the container at the given
index
. - BitDB_put_at(set, index, bitset)
-
Puts a
bitset
into the container at the givenindex
. - BitDB_extract_from(set, index, buffer)
-
Extracts a bitset from the container at
index
into abuffer
. - BitDB_replace_at(set, index, buffer)
-
Replaces a bitset in the container at
index
with the contents of abuffer
. - BitDB_clear(set)
-
Clears all bitsets in the container.
- BitDB_clear_at(set, index)
-
Clears the bitset at a given
index
in the container.
Set Operation Counts
These functions perform set operations between two bitset containers. The opts
parameter is an object of type Bit::Set::DB::SETOP_COUNT_OPTS
.
Example for opts
:
my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new(
num_cpu_threads => 4,
device_id => 0,
# ... other flags
);
Perform the respective set operation count on the CPU:
- BitDB_inter_count_cpu(db1, db2, opts)
- BitDB_union_count_cpu(db1, db2, opts)
- BitDB_diff_count_cpu(db1, db2, opts)
- BitDB_minus_count_cpu(db1, db2, opts)
Perform the respective set operation count on the GPU:
- BitDB_inter_count_gpu(db1, db2, opts)
- BitDB_union_count_gpu(db1, db2, opts)
- BitDB_diff_count_gpu(db1, db2, opts)
- BitDB_minus_count_gpu(db1, db2, opts)
Perform the respective set operation count on the CPU and store results in buffer
:
- BitDB_inter_count_store_cpu(db1, db2, buffer, opts)
- BitDB_union_count_store_cpu(db1, db2, buffer, opts)
- BitDB_diff_count_store_cpu(db1, db2, buffer, opts)
- BitDB_minus_count_store_cpu(db1, db2, buffer, opts)
Perform the respective set operation count on the GPU and store results in buffer
:
- BitDB_inter_count_store_gpu(db1, db2, buffer, opts)
- BitDB_union_count_store_gpu(db1, db2, buffer, opts)
- BitDB_diff_count_store_gpu(db1, db2, buffer, opts)
- BitDB_minus_count_store_gpu(db1, db2, buffer, opts)
EXAMPLES
Examples of the use of the Bit::Set::DB
module that emphasize performance characteristics and nuances of using raw memory buffers that are returned by the interface. Some of these examples are meant to be run as sequential units
- Example 1: Creating and initializing containers
-
In this example, we will create two Perl arrays of
Bit::Set
and then load them toBit::Set::DB
containers.use strict; use warnings; use Bit::Set qw(:all); use Bit::Set::DB qw(:all); my $size = 1024; my $num_of_bits = 3; my $num_of_ref_bits = 5; my @bits; my @bitsets; # Initializing and setting the values of the bitsets for my $i ( 0 .. $num_of_bits - 1 ) { $bits[$i] = Bit_new($size); my $end = int( $size / 2 ) + $i; $end = ( $end > $size - 1 ) ? $size - 1 : $end; Bit_set( $bits[$i], int( $size / 2 ), $end ); } for my $i ( 0 .. $num_of_ref_bits - 1 ) { $bitsets[$i] = Bit_new($size); my $end = int( $size / 2 ) + $i; $end = ( $end > $size - 1 ) ? $size - 1 : $end; Bit_set( $bitsets[$i], int( $size / 2 ), $end ); } Bit_set( $bits[0], int( $size / 2 ) - 1, int( $size / 2 ) + 5 ); Bit_set( $bitsets[0], int( $size / 2 ), int( $size / 2 ) + 5 ); # Create BitDB containers my $db1 = BitDB_new( $size, $num_of_bits ); my $db2 = BitDB_new( $size, $num_of_ref_bits ); # Now put the bitsets into the containers for my $i ( 0 .. $num_of_bits - 1 ) { BitDB_put_at( $db1, $i, $bits[$i] ); } for my $i ( 0 .. $num_of_ref_bits - 1 ) { BitDB_put_at( $db2, $i, $bitsets[$i] ); }
- Example 2: Obtaining the counts of bitset operations using containers
-
This example continues Example 1 by performing the intersection count in two different ways: 1) iterating over the Perl arrays of bitsets and 2) using the BitDB containers directly. A major benefit of these containerized operations is that they can leverage multi-threading in the CPU and hardware acceleration in GPUs (and TPUs in the near future). When we use the interface over containers, we will need to interface the integer array returned by the
Bit::Set::DB
interface function to Perl arrays. This is one possible way of doing so using theFFI::Platypus::Buffer
andFFI::Platypus::Buffer
modules.use FFI::Platypus::Buffer; use FFI::Platypus::Memory; use Config; # to get the size of int my $num_threads = 4; my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new( num_cpu_threads => $num_threads, device_id => 0, upd_1st_operand => 0, upd_2nd_operand => 0, release_1st_operand => 0, release_2nd_operand => 0, release_counts => 0 ); my $nelem = BitDB_nelem($db1) * BitDB_nelem($db2); # Method 1: Using Perl arrays of Bit::Set my @cpu_set_counts; for my $i ( 0 .. $num_of_bits - 1 ) { for my $j ( 0 .. $num_of_ref_bits - 1 ) { my $count = Bit_inter_count( $bits[$i], $bitsets[$j] ); push @cpu_set_counts, $count; } } # Method 2: Using Bit::Set::DB containers my $cpu_DB_counts_ptr = BitDB_inter_count_cpu( $db1, $db2, $opts ); my $scalar = buffer_to_scalar $cpu_DB_counts_ptr, $nelem*$Config{intsize}; my @cpu_DB_counts = unpack( "i[$nelem]", $scalar ); free $cpu_DB_counts_ptr; print "$_ " for @cpu_set_counts; print "\n"; print "$_ " for @cpu_DB_counts; print "\n";
Note the access pattern for the results returned by
BitDB_inter_count_cpu
: First we obtain the pointer to the results buffer,<$cpu_DB_counts_ptr
>. Second We convert this buffer into a scalar value usingbuffer_to_scalar
, specifying the size of the buffer in bytes. Third, we unpack the scalar into a Perl array usingunpack
. Finally, having copied the result into a Perl array, we now free the pointer to prevent memory leaks.If you are running this code inside a block, you should consider using the
defer
keyword (Perl versions 5.36 and newer) to automate memory de-allocation as explained in the documentation of FFI::Platypus::Memory - Example 3: Containerized Operations in the GPU
-
This example continues Example 2 by performing the intersection count using the GPU accelerated functions in the
Bit::Set::DB
module. The code is rather similar to the CPU version, with the major changes being the different set of options passed. Importantly, access patterns for the results is the same as the CPU interface. Memory allocations to and from the device are handled by the C library, though the user has absolute control over what is de-allocated and when as explained in the documentation of Bit. In particular the options argument:my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new( num_cpu_threads => ... , # number of CPU threads device_id => ... , # GPU device ID, ignored for CPU upd_1st_operand => ... , # if true, update the first container in the GPU upd_2nd_operand => ... , # if true, update the second container in the GPU release_1st_operand => ... , # if true, release the first container in the GPU release_2nd_operand => ... , # if true, release the second container in the GPU release_counts => ... , # if true, release the counts buffer in the GPU );
provide a way to update (replace the value of the containers) and release the memory used by the containers in the GPU when no longer needed. This way one can undertake batch operations more efficiently by preventing updating and de- allocation of the buffers when they are no longer needed.
For example, if the memory of the GPU can accomodate both containers and the intermediate results, we would ue the following code that continues Example 2:
my $gpu_opts = Bit::Set::DB::SETOP_COUNT_OPTS->new( num_cpu_threads => $num_threads, device_id => 0, upd_1st_operand => 0, upd_2nd_operand => 0, release_1st_operand => 1, release_2nd_operand => 1, release_counts => 1 ); # Method 3: Using Bit::Set::DB containers in the GPU my $gpu_DB_counts_ptr = BitDB_inter_count_gpu( $db1, $db2, $gpu_opts ); my $gpu_scalar = buffer_to_scalar $gpu_DB_counts_ptr, $nelem*$Config{intsize}; my @gpu_DB_counts = unpack( "i[$nelem]", $gpu_scalar ); free $gpu_DB_counts_ptr; print "$_ " for @gpu_DB_counts;
- Example 4: Perl Data Language (PDL) access pattern for
Bit::Set::DB
results -
Having the result in a C array, one can avoid the overhead of mapping them to a Perl scalar and then unpack them to an array, by wrapping them into a PDL ndarray. The pattern for wrapping external buffers in PDL is decribed in the PDL::API, and was also presented in my talk for Perl Community Conference Winter 2024. To use this access pattern, we will need to import various
PDL
andInline
packages and the PDL datatypes we will like to use. Continuing after Example 3, we now have:use PDL::LiteF; use PDL::Types '$PDL_L'; # Long which is equivalent to int in 64bit systems use Inline with => 'PDL'; use Inline C => 'DATA';
and write an
Inline
block to wrap around the buffer returned by the containerized functions inBit::Set::DB
. This inline block defines the functionmkndarray
and the relevant C code (reproduced below for completeness) was modified from PDL::API:__DATA__ __C__ #include <stdio.h> #include <stdlib.h> #include <stdint.h> #define IsSVValidPtr(sv) do { \ if (!SvOK((sv))) { \ croak("Pointer is not defined"); \ } \ if (!SvIOK((sv))) { \ croak("Pointer does not contain an integer"); \ } \ IV value = SvIV((sv)); \ if (value <= 0) { \ croak("Pointer is negative or zero"); \ } \ } while(0) #define DeclTypedPtr(type, ptr,sv) type *ptr; \ ptr = (type *) SvIV((sv)) void generate_random_double_array(SV *sv, size_t num_elements) { IsSVValidPtr(sv); DeclTypedPtr(double, array, sv); for (size_t i = 0; i < num_elements; ++i) { array[i] = ((double)rand() / RAND_MAX) * 10.0 - 5.0; } } double sum_array_C(SV *sv, size_t length) { IsSVValidPtr(sv); double sum = 0.0; DeclTypedPtr(double, array, sv); for (size_t i = 0; i < length; i++) { sum += array[i]; } return sum; } // Start of material from the PDL::API documentation void delete_mydata(pdl* pdl, int param) { pdl->data = 0; } typedef void (*DelMagic)(pdl *, int param); static void default_magic(pdl *p, int pa) { p->data = 0; } static pdl* pdl_wrap(void *data, int datatype, PDL_Indx dims[], int ndims, DelMagic delete_magic, int delparam) { pdl* p = PDL->pdlnew(); /* get the empty container */ if (!p) return p; pdl_error err = PDL->setdims(p, dims, ndims); /* set dims */ if (err.error) { PDL->destroy(p); return NULL; } p->datatype = datatype; /* and data type */ p->data = data; /* point it to your data */ /* make sure the core doesn't meddle with your data */ p->state |= PDL_DONTTOUCHDATA | PDL_ALLOCATED; if (delete_magic != NULL) PDL->add_deletedata_magic(p, delete_magic, delparam); else PDL->add_deletedata_magic(p, default_magic, 0); return p; } // End of material straight from the PDL::API documentation // modified from the PDL::API documentation pdl *mkndarray(SV *external_buffer, int datatype, AV *dims, size_t ndims) { IsSVValidPtr(external_buffer); size_t len = av_len(dims) + 1; if (ndims != len) { croak("Number of dimensions does not match the number of elements in dims"); } pdl *p; PDL_Indx dimensions[ndims]; for (int i = 0; i < len; i++) { SV **elem = av_fetch_simple(dims, i, 0); // perl 5.36 and above dimensions[i] = SvUV(*elem); } DeclTypedPtr(void, mydata, external_buffer); p = pdl_wrap(mydata, datatype, dimensions, ndims, delete_mydata, 0); return p; }
The function
mkndarray
creates a new PDL (Perl Data Language) object from an external buffer, a data type (exported from the "/metacpan.org/pod/PDL::Types" in PDL::Typeshttps:), a set of dimensions (provided as a Perl array reference), and the number of dimensions. The actual access pattern in Perl is rather straightforward:my $gpu_DB_counts_ptr = BitDB_inter_count_gpu( $db1, $db2, $gpu_opts ); my $pdl = mkndarray( $gpu_DB_counts_ptr, $PDL_L, [$nelem], 1 ); print $pdl->index($_)," " for 0 .. $nelem - 1; print "\n"; undef $pdl; free $gpu_DB_counts_ptr;
Note that when managing memory, undefining the PDL object does not lead to the release of the allocated buffer, which must be manually released. Note the order of de-allocations: first the PDL object, then the external buffer. One should strongly consider using the
defer
keyword to properly sequence the de-allocations and avoid segmentation faults. - Example 5: Profiling access patterns for results
-
This example provides a self-contained code profiler for an intersection count between two groups of bitsets. We will profile the performance of population counts of the intersection of these bitsets to find the maximum popcount via
- Nested Perl loops to over Perl arrays of
Bit::Set
- Multi-threaded container operations in the CPU using Perl arrays to find the maximum popcount
- Container operations in the GPU using Perl arrays to find the maximum popcount
- Multi-threaded container operations in the GPU using PDL to find the maximum popcount
- Multi-threaded container operations in the GPU using PDL to find the maximum popcount
In the Xeon E-2697v4 I used for this work, I obtained the following benchmarks:
- Test Description | Time (ns) | Searches/sec | Threads | Result | Speedup |
- -------------------------------------|---------------|--------------|---------|--------|---------|
- Bit::Set operations - Rep1 | 388,479,000 | 2.57 | 1 | 512 | 1.00 |
- Bit::Set operations - Rep2 | 389,512,000 | 2.57 | 1 | 512 | 1.00 |
- Bit::Set operations - Rep3 | 389,775,000 | 2.57 | 1 | 512 | 1.00 |
- Container - CPU | 82,856,000 | 12.07 | 1 | 512 | 4.69 |
- Container - CPU | 62,179,000 | 16.08 | 2 | 512 | 6.25 |
- Container - CPU | 59,269,000 | 16.87 | 3 | 512 | 6.55 |
- Container - CPU | 61,368,000 | 16.30 | 4 | 512 | 6.33 |
- Container - GPU | 261,441,000 | 3.82 | GPU | 512 | 1.49 |
- Container - GPU | 62,523,000 | 15.99 | GPU | 512 | 6.21 |
- Container - GPU | 61,467,000 | 16.27 | GPU | 512 | 6.32 |
- Container - CPU - PDL | 12,559,000 | 79.62 | 1 | 512 | 30.93 |
- Container - CPU - PDL | 9,313,000 | 107.38 | 2 | 512 | 41.71 |
- Container - CPU - PDL | 5,441,000 | 183.79 | 3 | 512 | 71.40 |
- Container - CPU - PDL | 4,457,000 | 224.37 | 4 | 512 | 87.16 |
- Container - GPU with PDL | 10,763,000 | 92.91 | GPU | 512 | 36.09 |
- Container - GPU with PDL | 8,662,000 | 115.45 | GPU | 512 | 44.85 |
- Container - GPU with PDL | 8,247,000 | 121.26 | GPU | 512 | 47.11 |
The table clearly illustrates the significant speed up of the containerized operations over the
Bit::Set
, but also the overhead of using Perl arrays to process the results versus the PDL data access patternThe code for the benchmark runs as a commandline command script and is the following:
use strict; use warnings; use Time::HiRes qw(gettimeofday tv_interval); use Bit::Set qw(:all); use Bit::Set::DB qw(:all); use Carp qw(croak); use FFI::Platypus::Buffer; use FFI::Platypus::Memory; use Config; # to get the size of int use PDL::LiteF; use PDL::Types '$PDL_L'; # Long which is equivalent to int in 64bit systems use Inline with => 'PDL'; use Inline C => 'DATA'; # Constants use constant MAX_THREADS => 1024; use constant MIN_SIZE => 128; # Parse command line arguments my ( $size, $num_of_bits, $num_of_ref_bits, $max_threads ); sub usage { my $prog = $0; print STDERR "Usage: $prog <size> <number of bitsets> <number of reference bitsets> <max threads>\n"; print STDERR "Example: $prog 1024 1000 1000000 4\n"; print STDERR "This will create 1000 bitsets size 1024, do an intersection count\n"; print STDERR "against another 1000000 bitsets, and run the test for 1-4 threads.\n"; print STDERR "Please ensure that the size is a positive integer.\n"; exit 1; } # Simple argument parsing usage() unless @ARGV == 4; ( $size, $num_of_bits, $num_of_ref_bits, $max_threads ) = @ARGV; # Validate arguments for my $arg ( $size, $num_of_bits, $num_of_ref_bits, $max_threads ) { unless ( $arg =~ /^\d+$/ && $arg > 0 ) { print STDERR "Error: size, number of bits, number of ref bits, and max threads must be positive integers.\n"; exit 1; } } # Apply limits if ( $max_threads > MAX_THREADS ) { print STDERR "Warning: max threads capped to " . MAX_THREADS . "\n"; $max_threads = MAX_THREADS; } if ( $size < MIN_SIZE ) { print STDERR "Warning: size increased to " . MIN_SIZE . "\n"; $size = MIN_SIZE; } print "Starting Perl benchmarks \n"; # Allocate the bitsets print "Allocating bitsets...\n"; my @bits; my @bitsets; for my $i ( 0 .. $num_of_bits - 1 ) { $bits[$i] = Bit_new($size); Bit_set( $bits[$i], int( $size / 2 ), $size - 1 ); } for my $i ( 0 .. $num_of_ref_bits - 1 ) { $bitsets[$i] = Bit_new($size); Bit_set( $bitsets[$i], int( $size / 2 ), $size - 1 ); } # Set some specific patterns for testing Bit_set( $bits[0], int( $size / 2 ) - 1, int( $size / 2 ) + 5 ); Bit_set( $bitsets[0], int( $size / 2 ), int( $size / 2 ) + 5 ); print "Finished allocating bitsets\n"; # Create BitDB containers my $db1 = BitDB_new( $size, $num_of_bits ); my $db2 = BitDB_new( $size, $num_of_ref_bits ); for my $i ( 0 .. $num_of_bits - 1 ) { BitDB_put_at( $db1, $i, $bits[$i] ); } for my $i ( 0 .. $num_of_ref_bits - 1 ) { BitDB_put_at( $db2, $i, $bitsets[$i] ); } print "Finished allocating BitDB\n"; # Benchmark functions sub database_match { my ( $bits_ref, $bitsets_ref ) = @_; my $max = 0; my @counts; $#counts = ( scalar(@$bits_ref) * scalar(@$bitsets_ref) ) - 1; for my $i ( 0 .. @$bits_ref - 1 ) { for my $j ( 0 .. @$bitsets_ref - 1 ) { my $count = Bit_inter_count( $bits_ref->[$i], $bitsets_ref->[$j] ); $counts[ $i * @$bitsets_ref + $j ] = $count; } } $max = ( ( $_ > $max ) ? $_ : $max ) for (@counts); return $max; } sub database_match_container_cpu { my ( $db1, $db2, $num_threads ) = @_; # Create the options structure for CPU processing my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new( num_cpu_threads => $num_threads, device_id => 0, upd_1st_operand => 0, upd_2nd_operand => 0, release_1st_operand => 0, release_2nd_operand => 0, release_counts => 0 ); my $nelem = BitDB_nelem($db1) * BitDB_nelem($db2); my $results = BitDB_inter_count_cpu( $db1, $db2, $opts ); my $scalar = buffer_to_scalar $results, $nelem * $Config{intsize}; my @DB_counts = unpack( "i[$nelem]", $scalar ); free $results; my $max = 0; $max = ( ( $_ > $max ) ? $_ : $max ) for (@DB_counts); return $max; } sub database_match_container_cpu_pdl { my ( $db1, $db2, $num_threads ) = @_; # Create the options structure for CPU processing my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new( num_cpu_threads => $num_threads, device_id => 0, upd_1st_operand => 0, upd_2nd_operand => 0, release_1st_operand => 0, release_2nd_operand => 0, release_counts => 0 ); my $nelem = BitDB_nelem($db1) * BitDB_nelem($db2); my $results = BitDB_inter_count_cpu( $db1, $db2, $opts ); my $pdl = mkndarray( $results, $PDL_L, [$nelem], 1 ); my $max = $pdl->max(); undef $pdl; free $results; return $max; } sub database_match_container_gpu { my ( $db1, $db2, $opts_ref ) = @_; my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new(@$opts_ref); my $results = BitDB_inter_count_gpu( $db1, $db2, $opts ); my $nelem = BitDB_nelem($db1) * BitDB_nelem($db2); my $scalar = buffer_to_scalar $results, $nelem * $Config{intsize}; my @DB_counts = unpack( "i[$nelem]", $scalar ); free $results; my $max = 0; $max = ( ( $_ > $max ) ? $_ : $max ) for (@DB_counts); return $max; } sub database_match_container_gpu_pdl { my ( $db1, $db2, $opts_ref ) = @_; my $opts = Bit::Set::DB::SETOP_COUNT_OPTS->new(@$opts_ref); my $results = BitDB_inter_count_gpu( $db1, $db2, $opts ); my $nelem = BitDB_nelem($db1) * BitDB_nelem($db2); my $pdl = mkndarray( $results, $PDL_L, [$nelem], 1 ); my $max = $pdl->max(); undef $pdl; free $results; return $max; } sub summarize_results { my ( $test, $time_elapsed, $num_of_threads, $result, $speedup ) = @_; my $searches_per_sec = 1_000_000_000 / $time_elapsed; my $thread_info = $num_of_threads > 0 ? sprintf( "%3d", $num_of_threads ) : "GPU"; printf "Total time for %-35s: %15.0f ns\t", $test, $time_elapsed; printf "Searches per second : %0.2f\t", $searches_per_sec; printf "Number of threads: %s \t", $thread_info; printf "Result: %d\t", $result; printf "Speedup factor: %.2f\n", $speedup; } # Storage for timings and results my @timings; my @results; print "Running benchmarks...\n"; # Warm up the processor my $max = database_match( \@bits, \@bitsets ); # Single-threaded runs (3 repetitions) for my $rep ( 1 .. 3 ) { my $t0 = [gettimeofday]; $max = database_match( \@bits, \@bitsets ); my $t1 = [gettimeofday]; my $total_time = tv_interval( $t0, $t1 ); push @timings, $total_time * 1_000_000_000; # Convert to nanoseconds push @results, $max; } print "Finished single-threaded match\n"; # We'll simulate the container-based operations instead print "Multi-threaded container operations in the CPU ...\n"; # Simulate container CPU operations for different thread counts for my $threads ( 1 .. $max_threads ) { my $t0 = [gettimeofday]; # In a real implementation, this would use the actual container functions $max = database_match_container_cpu( $db1, $db2, $threads ); my $t1 = [gettimeofday]; my $total_time = tv_interval( $t0, $t1 ); push @timings, $total_time * 1_000_000_000; push @results, $max; } print "Finished multi-threaded match in the CPU\n"; # Simulate GPU operations print "GPU container operations...\n"; my @gpu_configs = ( device_id => 0, upd_1st_operand => 1, upd_2nd_operand => 0, release_1st_operand => 0, release_2nd_operand => 0, release_counts => 0 ); for my $rep ( 1 .. 3 ) { my $t0 = [gettimeofday]; $max = database_match_container_gpu( $db1, $db2, \@gpu_configs ); my $t1 = [gettimeofday]; my $total_time = tv_interval( $t0, $t1 ); push @timings, $total_time * 1_000_000_000; push @results, $max; } print "Finished GPU container operations\n"; # do one more run to release the GPU resources my @gpu_configs_purge = ( device_id => 0, upd_1st_operand => 1, upd_2nd_operand => 1, release_1st_operand => 1, release_2nd_operand => 1, release_counts => 1 ); $max = database_match_container_gpu( $db1, $db2, \@gpu_configs_purge ); print "CPU container operations with PDL memory access ...\n"; for my $threads ( 1 .. $max_threads ) { my $t0 = [gettimeofday]; $max = database_match_container_cpu_pdl( $db1, $db2, $threads ); my $t1 = [gettimeofday]; my $total_time = tv_interval( $t0, $t1 ); push @timings, $total_time * 1_000_000_000; push @results, $max; } print "Finished CPU container operations with PDL memory access\n"; print "GPU container operations with PDL memory access ...\n"; for my $rep ( 1 .. 3 ) { my $t0 = [gettimeofday]; $max = database_match_container_gpu_pdl( $db1, $db2, \@gpu_configs ); my $t1 = [gettimeofday]; my $total_time = tv_interval( $t0, $t1 ); push @timings, $total_time * 1_000_000_000; push @results, $max; } print "Finished GPU container operations with PDL memory access\n"; # Print results print "\nResults:\n"; # Single-threaded results for my $rep ( 1 .. 3 ) { my $idx = $rep - 1; my $speedup = $idx == 0 ? 1.0 : $timings[0] / $timings[$idx]; summarize_results( "Bit::Set operations - Rep$rep", $timings[$idx], 1, $results[$idx], $speedup ); } # Multi-threaded results for my $threads ( 1 .. $max_threads ) { my $idx = 2 + $threads; my $speedup = $timings[0] / $timings[$idx]; summarize_results( "Container - CPU", $timings[$idx], $threads, $results[$idx], $speedup ); } # GPU results for my $gpu_run ( 1 .. 3 ) { my $idx = 2 + $max_threads + $gpu_run; my $speedup = $timings[0] / $timings[$idx]; summarize_results( "Container - GPU", $timings[$idx], -1, $results[$idx], $speedup ); } for my $threads ( 1 .. $max_threads ) { my $idx = 5 + $max_threads + $threads; my $speedup = $timings[0] / $timings[$idx]; summarize_results( "Container - CPU - PDL", $timings[$idx], $threads, $results[$idx], $speedup ); } for my $gpu_run ( 1 .. 3 ) { my $idx = 5 + 2*$max_threads + $gpu_run; my $speedup = $timings[0] / $timings[$idx]; summarize_results( "Container - GPU with PDL", $timings[$idx], -1, $results[$idx], $speedup ); } # Helper function sub min { my ( $a, $b ) = @_; return $a < $b ? $a : $b; } # Cleanup print "\nCleaning up...\n"; for my $bit (@bits) { Bit_free( \$bit ); } for my $bitset (@bitsets) { Bit_free( \$bitset ); } BitDB_free( \$db1 ); BitDB_free( \$db2 ); print "Benchmark completed!\n"; __DATA__ __C__ #include <stdio.h> #include <stdlib.h> #include <stdint.h> #define IsSVValidPtr(sv) do { \ if (!SvOK((sv))) { \ croak("Pointer is not defined"); \ } \ if (!SvIOK((sv))) { \ croak("Pointer does not contain an integer"); \ } \ IV value = SvIV((sv)); \ if (value <= 0) { \ croak("Pointer is negative or zero"); \ } \ } while(0) #define DeclTypedPtr(type, ptr,sv) type *ptr; \ ptr = (type *) SvIV((sv)) void generate_random_double_array(SV *sv, size_t num_elements) { IsSVValidPtr(sv); DeclTypedPtr(double, array, sv); for (size_t i = 0; i < num_elements; ++i) { array[i] = ((double)rand() / RAND_MAX) * 10.0 - 5.0; } } double sum_array_C(SV *sv, size_t length) { IsSVValidPtr(sv); double sum = 0.0; DeclTypedPtr(double, array, sv); for (size_t i = 0; i < length; i++) { sum += array[i]; } return sum; } // Start of material from the PDL::API documentation void delete_mydata(pdl* pdl, int param) { pdl->data = 0; } typedef void (*DelMagic)(pdl *, int param); static void default_magic(pdl *p, int pa) { p->data = 0; } static pdl* pdl_wrap(void *data, int datatype, PDL_Indx dims[], int ndims, DelMagic delete_magic, int delparam) { pdl* p = PDL->pdlnew(); /* get the empty container */ if (!p) return p; pdl_error err = PDL->setdims(p, dims, ndims); /* set dims */ if (err.error) { PDL->destroy(p); return NULL; } p->datatype = datatype; /* and data type */ p->data = data; /* point it to your data */ /* make sure the core doesn't meddle with your data */ p->state |= PDL_DONTTOUCHDATA | PDL_ALLOCATED; if (delete_magic != NULL) PDL->add_deletedata_magic(p, delete_magic, delparam); else PDL->add_deletedata_magic(p, default_magic, 0); return p; } // End of material straight from the PDL::API documentation // modified from the PDL::API documentation pdl *mkndarray(SV *external_buffer, int datatype, AV *dims, size_t ndims) { IsSVValidPtr(external_buffer); size_t len = av_len(dims) + 1; if (ndims != len) { croak("Number of dimensions does not match the number of elements in dims"); } pdl *p; PDL_Indx dimensions[ndims]; for (int i = 0; i < len; i++) { SV **elem = av_fetch_simple(dims, i, 0); // perl 5.36 and above dimensions[i] = SvUV(*elem); } DeclTypedPtr(void, mydata, external_buffer); p = pdl_wrap(mydata, datatype, dimensions, ndims, delete_mydata, 0); return p; }
- Nested Perl loops to over Perl arrays of
SEE ALSO
- Bit::Set
-
Bit::Set
is a Perl module that provides a high-level interface for working with bitsets. It is built on top of the Bit library and offers a more user-friendly API for common bitset operations. It is the parent module ofBit::Set::DB
and provides further details about the vibecoding of theBit::Set::DB
module. - Bit
-
Bit is a high-performance, uncompressed bitset implementation in C, optimized for modern architectures. The library provides an efficient way to create, manipulate, and query bitsets with a focus on performance and memory alignment. The API and the interface is largely based on David Hanson's Bit_T library discussed in Chapter 13 of "C Interfaces and Implementations", Addison-Wesley ISBN 0-201-49841-3 extended to incorporate additional operations (such as counts on unions/differences/intersections of sets), fast population counts using the libpocnt library and GPU operations for packed containers of (collections) of Bit(sets).
- Alien::Bit
-
This distribution provides the library Bit so that it can be used by other Perl distributions that are on CPAN. It will download Bit from Github and will build the (static and dynamic) versions of the library for use by other Perl modules.
AUTHOR
GitHub Copilot (Claude Sonnet 4), guided by Christos Argyropoulos.
COPYRIGHT AND LICENSE
This software is copyright (c) 2025 by Christos Argyropoulos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.