NAME
Data::Bvec - a module to manipulate integer arrays as bit vectors and "compressed bit strings" (using a simple RLE).
VERSION
VERSION: 1.01
SYNOPSIS
use Data::Bvec;
my $bv = Data::Bvec::->new( nums=>[1,2,3] );
my $vec = $bv->get_bvec(); # 01110000
my $bstr = $bv->get_bstr(); # '-134'
my $nums = $bv->get_nums(); # [1,2,3]
----
use Data::Bvec qw( :all );
my $vec = num2bit( [1,2,3] ); # 01110000
set_bit( $vec, 4, 1 ); # 01111000
my $bstr = compress bit2str $vec; # '-143'
my $nums = bit2num str2bit uncompress $bstr; # [1,2,3,4]
DESCRIPTION/DISCUSSION
This module encapsulates some simple routines for manipulating Perl bit vectors (putting values in; getting values out), but its main goal is to implement a simple run-length encoding scheme for bit vectors that compresses them into relatively human-readable and flat-file-storable strings.
My use case was wanting to prototype a data indexing system, and I wanted to ease debugging by plopping the bitstrings in a flat file that I could examine directly. (Each bit in a vector represents a record in the database -- true or false whether the term is in that record in the field being indexed.) It has worked well enough that I haven't felt the need to change how the bitstrings are stored (just where they're stored).
The initial version of the module used a different set of base-62 digits. In writing Math::Int2Base, I decided to normalize all the bases from 2 to 62 to use 0-9,A-Z,a-z. It makes the numbers sort correctly (ascii-betically == numerically), and it let me say that A base-16 == A base-36 == A base-62.
So now I'm rewriting this module to use those base conversion routines.
EXPORTS
Nothing is exported by default. The following may be exported individually; all of them may be exported using the :all
tag:
- set_bit
- howmany
- bit2str
- str2bit
- bit2num
- num2bit
- compress
- uncompress
Examples:
use Data::Bvec qw( :all );
use Data::Bvec qw( bit2str str2bit compress uncompress );
However, if you only use the object methods, nothing would need to be exported. See below.
SUBROUTINES
set_bit( $vec, $num, $zero_or_one )
This is a shallow wrapper around Perl's vec() that simply provides the third parameter (1) to that routine that says we're working with a bit vector.
Normally returns $num, if you care.
Parameters:
$vec
A Perl bit vector stored in the scalar.
$num
The number whose bit you want to target in the bit vector.
$zero_or_one
The value you want to set the bit to: 0 or 1. If not defined, 1 is assumed.
Examples:
my $vec = ""; # empty vector
set_bit $vec, 1, 1; # 01000000
set_bit $vec, 2, 1; # 01100000
set_bit $vec, 3; # 01110000
set_bit $vec, 1, 0; # 00110000
bit2str( $vec )
This routine is a shallow wrapper around unpack() that unpacks a bit vector into a string of '0's and '1's, in preparation for compression.
Parameters:
$vec
A Perl bit vector.
Example:
my $vec = "";
set_bit $vec, 4, 1; # 00001000
my $str = bit2str $vec; # '00001000'
str2bit( $str )
This routine is a shallow wrapper around pack() that packs a string of '0's and '1's (following uncompression) into a bit vector.
Parameters:
$str
A string of '0's and '1's, e.g., "00001000".
Example:
my $vec = str2bit '00001000';
num2bit( \@integers )
This routine accepts an array ref of integers and returns a bit vector with those integer's bits turned on.
Parameters:
\@integers
A reference to an array of integers.
Examples:
my $vec = num2bit [1,2,3]; # 01110000
my $vec = num2bit [3,2,1]; # 01110000
The second example is intended to make clear that the order of the integers in the array is not retained (for obvious reasons), and calling bit2num( $vec ) will always return the integers in ascending order (see bit2num() below).
bit2num( $vec, $beg, $cnt )
This routine accepts a bit vector and returns an array of integers represented by the 1 bits.
The parameters $beg and $cnt are to support retrieving subsets of integers from a large vector -- in essence, to support "paging" through the set.
In scalar context, returns a reference to the array.
Parameters:
$vec
A bit vector.
$beg
The first integer (where the bit is 1) to return. Unlike array subscripts, the $beg positions start with 1, not 0.
$cnt
The maximum number of integers (including the first) to return.
Examples:
# 0----+----1----+----2----+----3-
my $vec = str2bit '01110011110001111100001111110001';
my $set1 = bit2num $vec, 1, 5; # [ 1, 2, 3, 6, 7 ]
my $set2 = bit2num $vec, 6, 5; # [ 8, 9, 13, 14, 15 ]
my $set3 = bit2num $vec, 11, 5; # [ 16, 17, 22, 23, 24 ]
my $set4 = bit2num $vec, 16, 5; # [ 25, 26, 27, 31 ]
compress( $str )
This routine takes a string of '0's and '1's and compresses it using a simple run-length encoding (RLE). It returns this "compressed bit string".
Parameters:
$str
A string of '0's and '1's, e.g., "01110".
Note: the length of the string need not be a multiple of 8.
Example:
my $bstr;
$bstr = compress '01110000'; # '-134'
my $str = ('1'x100).('0'x30).('1'x6);
$bstr = compress $str; # '+@1cU6'
Compression Scheme
The compression scheme counts the number of consecutive '0's and '1's and concatenates that count (in base-62) to the compressed bit string.
If the first bit is '0', the compressed bit string begins with '-'. If the first bit is '1', it begins with '+'. The digit following that represents that many of those bits. The next digit represents that many of the "other" bits, and so on. (A "digit" matches /[0-9A-Za-z]/.)
So in the first example, '-134' means 1 '0' bit, then 3 '1' bits, then 4 '0' bits, i.e., '01110000'.
The second example includes a 2-digit number, 1c base-62 (100 decimal, as defined by Math::Int2base).
Any multi-digit number is preceded by a non-digit:
'@' for a 2-digit number
'#' for 3 digits
'$' for 4 digits
'%' for 5 digits, and
'^' for 6 digits
(Mnemonic: look above the numbers on a qwerty keyboard. A 6-digit number will accommodate 32,590,299,105 consecutive bits. If you need more than that, let me know.)
So '+@1cU6' means 1c (100) '1' bits, then U (30) '0' bits, then 6 '1' bits.
uncompress( $bstr )
This routine uncompresses a compressed bit string (which would have been compressed by the compress() routine above).
It returns a string of '0's and '1's. This string will (normally) then be converted to a bit vector using str2bit() above.
Parameters:
$bstr
A compressed bit string (see compress() above).
Example:
my $bstr = '-134';
my $str = uncompress $bstr; # '01110000'
howmany( $vec, $zero_or_one )
This routine returns a count of the 0 or 1 bits in a bit vector.
Parameters:
$vec
A bit vector.
$zero_or_one
The value you want a count of: 0 or 1. Defaults to 1 if not given.
Examples:
my $vec = str2bit '01010010';
my $ones_count = howmany $vec; # 3
my $zeros_count = howmany $vec, 0; # 5
Note that howmany( $vec, 0 ) will include trailing zero bits.
METHODS
new()
This constructs a Data::Bvec object. Each object represents a single array of integers stored either as a bit vector, a compressed bit string, or an array.
Parameters:
All parameters to new() are named.
bvec=>$bit_vector
Stores a Perl bit vector in the object.
my $vec = str2bit '01110011110001111100001111110001';
my $bv = Data::Bvec::->new( bvec => $vec );
bstr=>$compressed_bit_string
Stores a compressed bit string in the object.
my $bstr = compress bit2str $vec;
my $bv = Data::Bvec::->new( bstr => $bstr );
nums=>\@integers
Stores an array of integers in the object. The order of the array is retained when stored.
my $nums = bit2num $vec;
my $bv = Data::Bvec::->new( nums => $nums );
bvec2nums=>$bit_vector
Accepts a bit vector and stores it as an array of integers (as $self->{nums}).
my $bv = Data::Bvec::->new( bvec2nums => $vec );
nums2bvec=>\@integers
Accepts an array of integers and stores it as a bit vector (as $self->{bvec}). The order of the array is not retained.
my $bv = Data::Bvec::->new( nums2bvec => $nums );
bvec2bstr=>$bit_vector
Accepts a bit vector and stores it as a compressed bit string (as $self->{bstr}).
my $bv = Data::Bvec::->new( bvec2bstr => $vec );
bstr2bvec=>$compressed_bit_string
Accepts a compressed bit string and stores it as a bit vector (as $self->{bvec}).
my $bv = Data::Bvec::->new( bstr2bvec => $bstr );
bstr2nums=>$compressed_bit_string
Accepts a compressed bit string and stores it as an array of integers (as $self->{nums}).
my $bv = Data::Bvec::->new( bstr2nums => $bstr );
nums2bstr=>\@integers
Accepts an array of integers and stores it as a compressed bit string (as $self->{bstr}). The order of the array is not retained.
my $bv = Data::Bvec::->new( nums2bstr => $nums );
get_bvec()
This routine takes no parameters. It returns a bit vector, regardless how the integers are stored. The object is not changed.
my $vec = $bv->get_bvec();
get_bstr()
This routine takes no parameters. It returns a compressed bit string, regardless how the integers are stored. The object is not changed.
my $bstr = $bv->get_bstr();
get_nums( $beg, $cnt )
This routine returns an array of integers, regardless how the integers are stored. The object is not changed.
Note that if the integers are stored as 'nums' (an array), get_nums() will return them in the same order as the array. If they are stored another way, they will be returned in ascending order.
my @integers = $bv->get_nums(); # list returned in list context
my $ints = $bv->get_nums(); # aref returned in scalar context
Parameters:
$beg
The first integer to return. Unlike array subscripts, the $beg positions start with 1, not 0. If no $beg is given, 1 is assumed.
$cnt
The maximum number of integers (including the first) to return. If no $cnt is given, the rest of the integers are returned.