NAME

Data::Bvec - a module to manipulate integer arrays as bit vectors and "compressed bit strings" (using a simple RLE).

VERSION

VERSION: 1.01

SYNOPSIS

use Data::Bvec;

my $bv = Data::Bvec::->new( nums=>[1,2,3] );

my $vec  = $bv->get_bvec();  # 01110000
my $bstr = $bv->get_bstr();  # '-134'
my $nums = $bv->get_nums();  # [1,2,3]

----

use Data::Bvec qw( :all );

my $vec  = num2bit( [1,2,3] );                # 01110000
set_bit( $vec, 4, 1 );                        # 01111000
my $bstr = compress bit2str $vec;             # '-143'
my $nums = bit2num str2bit uncompress $bstr;  # [1,2,3,4]

DESCRIPTION/DISCUSSION

This module encapsulates some simple routines for manipulating Perl bit vectors (putting values in; getting values out), but its main goal is to implement a simple run-length encoding scheme for bit vectors that compresses them into relatively human-readable and flat-file-storable strings.

My use case was wanting to prototype a data indexing system, and I wanted to ease debugging by plopping the bitstrings in a flat file that I could examine directly. (Each bit in a vector represents a record in the database -- true or false whether the term is in that record in the field being indexed.) It has worked well enough that I haven't felt the need to change how the bitstrings are stored (just where they're stored).

The initial version of the module used a different set of base-62 digits. In writing Math::Int2Base, I decided to normalize all the bases from 2 to 62 to use 0-9,A-Z,a-z. It makes the numbers sort correctly (ascii-betically == numerically), and it let me say that A base-16 == A base-36 == A base-62.

So now I'm rewriting this module to use those base conversion routines.

EXPORTS

Nothing is exported by default. The following may be exported individually; all of them may be exported using the :all tag:

- set_bit
- howmany
- bit2str
- str2bit
- bit2num
- num2bit
- compress
- uncompress

Examples:

use Data::Bvec qw( :all );
use Data::Bvec qw( bit2str str2bit compress uncompress );

However, if you only use the object methods, nothing would need to be exported. See below.

SUBROUTINES

set_bit( $vec, $num, $zero_or_one )

This is a shallow wrapper around Perl's vec() that simply provides the third parameter (1) to that routine that says we're working with a bit vector.

Normally returns $num, if you care.

Parameters:

$vec

A Perl bit vector stored in the scalar.

$num

The number whose bit you want to target in the bit vector.

$zero_or_one

The value you want to set the bit to: 0 or 1. If not defined, 1 is assumed.

Examples:

my $vec = "";  # empty vector

set_bit $vec, 1, 1; # 01000000
set_bit $vec, 2, 1; # 01100000
set_bit $vec, 3;    # 01110000
set_bit $vec, 1, 0; # 00110000

bit2str( $vec )

This routine is a shallow wrapper around unpack() that unpacks a bit vector into a string of '0's and '1's, in preparation for compression.

Parameters:

$vec

A Perl bit vector.

Example:

my $vec = "";
set_bit $vec, 4, 1;      #  00001000
my $str = bit2str $vec;  # '00001000'

str2bit( $str )

This routine is a shallow wrapper around pack() that packs a string of '0's and '1's (following uncompression) into a bit vector.

Parameters:

$str

A string of '0's and '1's, e.g., "00001000".

Example:

my $vec = str2bit '00001000';

num2bit( \@integers )

This routine accepts an array ref of integers and returns a bit vector with those integer's bits turned on.

Parameters:

\@integers

A reference to an array of integers.

Examples:

my $vec = num2bit [1,2,3];  # 01110000
my $vec = num2bit [3,2,1];  # 01110000

The second example is intended to make clear that the order of the integers in the array is not retained (for obvious reasons), and calling bit2num( $vec ) will always return the integers in ascending order (see bit2num() below).

bit2num( $vec, $beg, $cnt )

This routine accepts a bit vector and returns an array of integers represented by the 1 bits.

The parameters $beg and $cnt are to support retrieving subsets of integers from a large vector -- in essence, to support "paging" through the set.

In scalar context, returns a reference to the array.

Parameters:

$vec

A bit vector.

$beg

The first integer (where the bit is 1) to return. Unlike array subscripts, the $beg positions start with 1, not 0.

$cnt

The maximum number of integers (including the first) to return.

Examples:

#                   0----+----1----+----2----+----3-
my $vec  = str2bit '01110011110001111100001111110001';

my $set1 = bit2num $vec,  1, 5;  # [  1,  2,  3,  6,  7 ]
my $set2 = bit2num $vec,  6, 5;  # [  8,  9, 13, 14, 15 ]
my $set3 = bit2num $vec, 11, 5;  # [ 16, 17, 22, 23, 24 ]
my $set4 = bit2num $vec, 16, 5;  # [ 25, 26, 27, 31     ]

compress( $str )

This routine takes a string of '0's and '1's and compresses it using a simple run-length encoding (RLE). It returns this "compressed bit string".

Parameters:

$str

A string of '0's and '1's, e.g., "01110".

Note: the length of the string need not be a multiple of 8.

Example:

my $bstr;
$bstr = compress '01110000';  # '-134'
my $str = ('1'x100).('0'x30).('1'x6);
$bstr = compress $str;        # '+@1cU6'

Compression Scheme

The compression scheme counts the number of consecutive '0's and '1's and concatenates that count (in base-62) to the compressed bit string.

If the first bit is '0', the compressed bit string begins with '-'. If the first bit is '1', it begins with '+'. The digit following that represents that many of those bits. The next digit represents that many of the "other" bits, and so on. (A "digit" matches /[0-9A-Za-z]/.)

So in the first example, '-134' means 1 '0' bit, then 3 '1' bits, then 4 '0' bits, i.e., '01110000'.

The second example includes a 2-digit number, 1c base-62 (100 decimal, as defined by Math::Int2base).

Any multi-digit number is preceded by a non-digit:

'@' for a 2-digit number
'#' for 3 digits
'$' for 4 digits
'%' for 5 digits, and
'^' for 6 digits

(Mnemonic: look above the numbers on a qwerty keyboard. A 6-digit number will accommodate 32,590,299,105 consecutive bits. If you need more than that, let me know.)

So '+@1cU6' means 1c (100) '1' bits, then U (30) '0' bits, then 6 '1' bits.

uncompress( $bstr )

This routine uncompresses a compressed bit string (which would have been compressed by the compress() routine above).

It returns a string of '0's and '1's. This string will (normally) then be converted to a bit vector using str2bit() above.

Parameters:

$bstr

A compressed bit string (see compress() above).

Example:

my $bstr = '-134';
my $str  = uncompress $bstr;  # '01110000'

howmany( $vec, $zero_or_one )

This routine returns a count of the 0 or 1 bits in a bit vector.

Parameters:

$vec

A bit vector.

$zero_or_one

The value you want a count of: 0 or 1. Defaults to 1 if not given.

Examples:

my $vec = str2bit '01010010';
my $ones_count  = howmany $vec;     # 3
my $zeros_count = howmany $vec, 0;  # 5

Note that howmany( $vec, 0 ) will include trailing zero bits.

METHODS

new()

This constructs a Data::Bvec object. Each object represents a single array of integers stored either as a bit vector, a compressed bit string, or an array.

Parameters:

All parameters to new() are named.

bvec=>$bit_vector

Stores a Perl bit vector in the object.

my $vec = str2bit '01110011110001111100001111110001';
my $bv  = Data::Bvec::->new( bvec => $vec );

bstr=>$compressed_bit_string

Stores a compressed bit string in the object.

my $bstr = compress bit2str $vec;
my $bv   = Data::Bvec::->new( bstr => $bstr );

nums=>\@integers

Stores an array of integers in the object. The order of the array is retained when stored.

my $nums = bit2num $vec;
my $bv   = Data::Bvec::->new( nums => $nums );

bvec2nums=>$bit_vector

Accepts a bit vector and stores it as an array of integers (as $self->{nums}).

my $bv = Data::Bvec::->new( bvec2nums => $vec );

nums2bvec=>\@integers

Accepts an array of integers and stores it as a bit vector (as $self->{bvec}). The order of the array is not retained.

my $bv = Data::Bvec::->new( nums2bvec => $nums );

bvec2bstr=>$bit_vector

Accepts a bit vector and stores it as a compressed bit string (as $self->{bstr}).

my $bv = Data::Bvec::->new( bvec2bstr => $vec );

bstr2bvec=>$compressed_bit_string

Accepts a compressed bit string and stores it as a bit vector (as $self->{bvec}).

my $bv = Data::Bvec::->new( bstr2bvec => $bstr );

bstr2nums=>$compressed_bit_string

Accepts a compressed bit string and stores it as an array of integers (as $self->{nums}).

my $bv = Data::Bvec::->new( bstr2nums => $bstr );

nums2bstr=>\@integers

Accepts an array of integers and stores it as a compressed bit string (as $self->{bstr}). The order of the array is not retained.

my $bv = Data::Bvec::->new( nums2bstr => $nums );

get_bvec()

This routine takes no parameters. It returns a bit vector, regardless how the integers are stored. The object is not changed.

my $vec = $bv->get_bvec();

get_bstr()

This routine takes no parameters. It returns a compressed bit string, regardless how the integers are stored. The object is not changed.

my $bstr = $bv->get_bstr();

get_nums( $beg, $cnt )

This routine returns an array of integers, regardless how the integers are stored. The object is not changed.

Note that if the integers are stored as 'nums' (an array), get_nums() will return them in the same order as the array. If they are stored another way, they will be returned in ascending order.

my @integers = $bv->get_nums();  # list returned in list context
my $ints     = $bv->get_nums();  # aref returned in scalar context

Parameters:

$beg

The first integer to return. Unlike array subscripts, the $beg positions start with 1, not 0. If no $beg is given, 1 is assumed.

$cnt

The maximum number of integers (including the first) to return. If no $cnt is given, the rest of the integers are returned.