NAME
Data::BitStream - A bit stream class including integer coding methods
VERSION
version 0.07
SYNOPSIS
use Data::BitStream;
my $stream = Data::BitStream->new;
$stream->put_gamma($_) for (1 .. 20);
$stream->rewind_for_read;
my @values = $stream->get_gamma(-1);
See the examples for more uses.
DESCRIPTION
A Moo class providing read/write access to bit streams. This includes many integer coding methods as well as straightforward ways to implement new codes.
Bit streams are often used in data compression and in embedded products where memory is at a premium. While this Perl implementation may not be appropriate for many of these applications (speed and Perl), it can be very useful for prototyping and experimenting with different codes. A future implementation using XS for internals may resolve some performance concerns.
EXAMPLES
Display bit patterns for some codes
use Data::BitStream;
sub string_of { my $stream = Data::BitStream->new;
$_[0]->($stream);
return $stream->to_string; }
my @codes = qw(Gamma Delta Omega Fib);
printf "%5s " . (" %-11s" x scalar @codes) . "\n", 'N', @codes;
foreach my $n (0 .. 20) {
printf "%5d ", $n;
printf " %-11s", string_of(sub{shift->put_gamma($n)});
printf " %-11s", string_of(sub{shift->put_delta($n)});
printf " %-11s", string_of(sub{shift->put_omega($n)});
printf " %-11s", string_of(sub{shift->put_fib($n)});
print "\n";
}
A simple predictor/encoder compression snippit
use Data::BitStream;
my $stream = Data::BitStream->new;
# Loop over the data: characters, pixels, table entries, etc.
foreach my $v (@values) {
# predict the current value using your subroutine. This routine
# will use one or more previous values to estimate the current one.
my $p = predict($v);
# determine the signed difference.
my $diff = $v - $p;
# Turn this into an absolute difference suitable for coding.
my $error = ($diff < 0) ? -2*$diff : 2*$diff-1;
# Encode this using gamma encoding (or whichever works best for you).
$stream->put_gamma($error);
}
# Nicely packed up compressed data.
my $compressed_data = $stream->to_raw;
This is a classic prediction-coding style compression method, used in many applications. Most lossless image compressors use this method, though often with some extra steps further reduce the error term. JPEG-LS, for example, uses a very simple predictor, and puts its effort into relatively complex bias estimations and adaptive determination of the parameter for Rice coding.
Convert Elias Delta encoded strings into Fibonacci encoded strings
#/usr/bin/perl
use Data::BitStream;
my $d = Data::BitStream->new;
my $f = Data::BitStream->new;
while (<>) {
chomp;
$d->from_string($_);
$f->erase_for_write;
$f->put_fib( $d->get_delta(-1) );
print scalar $f->to_string, "\n";
}
Using a custom encoding method
use Data::BitStream;
use Data::BitStream::Code::Baer;
use Data::BitStream::Code::BoldiVigna;
my $stream = Data::BitStream->new;
# Moose:
Data::BitStream::Code::Baer->meta->apply($stream);
Data::BitStream::Code::BoldiVigna->meta->apply($stream);
# Moo:
Moo::Role->apply_roles_to_object($stream,
qw/Data::BitStream::Code::Baer Data::BitStream::Code::BoldiVigna/);
$stream->put_baer(-1, 14); # put 14 as a Baer c-1 code
$stream->put_boldivigna(2, 7); # put 7 as a Zeta(2) code
$stream->rewind_for_read;
my $v1 = $stream->get_baer(-1);
my $v2 = $stream->get_boldivigna(2);
Not all codes are included by default, including the power-law codes of Michael Baer, the Zeta codes of Boldi and Vigna, and Escape codes. These, and any other codes write or acquire, can be incorporated using Moo::Role or the Moose MOP as shown above.
METHODS
CLASS METHODS
- new
-
Creates a new object. By default it has no associated file and is mode RW. An optional hash of arguments may be supplied. Examples:
$stream = Data::BitStream->new( mode => 'ro' );
The stream is opened as a read-only stream. Attempts to open it for write will fail, hence all write / put methods will also fail. This is most useful for opening a file for read, which will ensure no changes are made.
Possible modes include
'read' / 'r'
,'readonly' / 'ro'
,'write' / 'w'
,'writeonly' / 'wo'
,'append' / 'a'
, and'readwrite' / 'rw' / 'rdwr'
.$stream = Data::BitStream->new( file => "c.bsc", fheader => "HEADER $foo $bar", mode => 'w' );
A file is associated with the stream. Upon closing the file, going out of scope, or otherwise being destroyed, the stream will be written to the file, with the given header string written first. While the current implementation writes at close time, later implementations may write as the stream is written to.
$stream = Data::BitStream->new( file => "c.bsc", fheaderlines => 1, mode => 'ro' );
A file is associated with the stream. The contents of the file will be slurped into the stream. The given number of header lines will be skipped at the start. While the current implementation slurps the contents, later implementations may read from the file as the stream is read.
- maxbits
-
Returns the number of bits in a word, which is the largest allowed size of the
bits
argument toread
andwrite
. This will be either 32 or 64. - code_is_supported
-
Returns a hash of information about a code if it is known, and
undef
otherwise.The argument is a text name, such as
'Gamma'
,'Rice(2)'
, etc.This method may be exported if requested.
- code_is_universal
-
Returns
undef
if the code is not known,0
if the code is non-universal, and a non-zero integer if it is universal.The argument is a text name, such as
'Gamma'
,'Rice(2)'
, etc.A code is universal if there exists a constant
C
such thatC
plus the length of the code is less than the optimal code length, for all values. What this typically means for us in practical terms is that non-universal codes are fine for small numbers, but their size increases rapidly, making them inappropriate when large values are possible (no matter how rare). A classic non-universal code is Unary coding, which takesk+1
bits to store valuek
. This is very good if most values are 0 or near zero. If we have rare values in the tens of thousands, it's not so great. It is likely to be fatal if we ever come across a value of 2 billion.This method may be exported if requested.
- add_code
-
Used for the dispatch table methods
code_put
andcode_get
as well as other helper methods likecode_is_universal
andcode_is_supported
. This is typically handled internally, but can be used to register a new code or variant. An example of an Omega-Golomb code:Data::BitStream::XS::add_code( { package => __PACKAGE__, name => 'OmegaGolomb', universal => 1, params => 1, encodesub => sub {shift->put_golomb( sub {shift->put_omega(@_)}, @_ )}, decodesub => sub {shift->get_golomb( sub {shift->get_omega(@_)}, @_ )}, } );
which registers the name
OmegaGolomb
as a new universal code that takes one parameter. Given a stream$stream
, this is now allowed:$stream->erase_for_write; $stream->code_put("OmegaGolomb(5)", 477); $stream->rewind_for_read; my $value = $stream->code_get("OmegaGolomb(5)"); die unless $value == 477;
OBJECT METHODS (reading)
These methods are only valid while the stream is in reading state.
- rewind
-
Moves the position to the stream beginning.
- exhausted
-
Returns true is the stream is at the end. Rarely used.
- read($bits [, 'readahead'])
-
Reads
$bits
from the stream and returns the value.$bits
must be between1
andmaxbits
.The position is advanced unless the second argument is the string 'readahead'.
Attempting to read past the end of the stream is a fatal error. However, readahead is allowed as it is speculative. All positions past the end of the stream will always be filled with zero bits.
- skip($bits)
-
Advances the position
$bits
bits. Used in conjunction withreadahead
.Attempting to skip past the end of the stream is a fatal error.
- read_string($bits)
-
Reads
$bits
bits from the stream and returns them as a binary string, such as'0011011'
. Attempting to read past the end of the stream is a fatal error.
OBJECT METHODS (writing)
These methods are only valid while the stream is in writing state.
- write($bits, $value)
-
Writes
$value
to the stream using$bits
bits.$bits
must be between1
andmaxbits
, unlessvalue
is 0 or 1, in which casebits
may be larger thanmaxbits
.The stream length will be increased by
$bits
bits. Regardless of the contents of$value
, exactly$bits
bits will be used. If$value
has more non-zero bits than$bits
, the lower bits are written. In other words,$value
will be effectively masked before writing. - put_string(@strings)
-
Takes one or more binary strings (e.g.
'1001101'
,'001100'
) and writes them to the stream. The number of bits used for each value is equal to the string length. - put_stream($source_stream)
-
Writes the contents of
$source_stream
to the stream. This is a helper method that might be more efficient than doing it in one of the many other possible ways. The default implementation uses:$self->put_string( $source_stream->to_string );
OBJECT METHODS (conversion)
These methods may be called at any time, and will adjust the state of the stream.
- to_string
-
Returns the stream as a binary string, e.g. '00110101'.
- to_raw
-
Returns the stream as packed big-endian data. This form is portable to any other implementation on any architecture.
- to_store
-
Returns the stream as some scalar holding the data in some implementation specific way. This may be portable or not, but it can always be read by the same implementation. It might be more efficient than the raw format.
- from_string($string)
-
The stream will be set to the binary string
$string
. - from_raw($packed [, $bits])
-
The stream is set to the packed big-endian vector
$packed
which has$bits
bits of data. If$bits
is not present, thenlength($packed)
will be used as the byte-length. It is recommended that you include$bits
. - from_store($blob [, $bits])
-
Similar to
from_raw
, but using the value returned byto_store
.
OBJECT METHODS (other)
- pos
-
A read-only non-negative integer indicating the current position in a read stream. It is advanced by
read
,get
, andskip
methods, as well as changed byto
,from
,rewind
, anderase
methods. - len
-
A read-only non-negative integer indicating the current length of the stream in bits. It is advanced by
write
andput
methods, as well as changed byfrom
anderase
methods. - writing
-
A read-only boolean indicating whether the stream is open for writing or reading. Methods for read such as
read
,get
,skip
,rewind
,skip
, andexhausted
are not allowed while writing. Methods for write such aswrite
andput
are not allowed while reading.The
write_open
anderase_for_write
methods will set writing to true. Thewrite_close
andrewind_for_read
methods will set writing to false.The read/write distinction allows implementations more freedom in internal caching of data. For instance, they can gather writes into blocks. It also can be helpful in catching mistakes such as reading from a target stream.
- erase
-
Erases all the data, while the writing state is left unchanged. The position and length will both be 0 after this is finished.
- write_open
-
Changes the state to writing with no other API-visible changes.
- write_close
-
Changes the state to reading, and the position is set to the end of the stream. No other API-visible changes happen.
- erase_for_write
-
A helper function that performs
erase
followed bywrite_open
. - rewind_for_read
-
A helper function that performs
write_close
followed byrewind
.
OBJECT METHODS (coding)
All coding methods are biased to 0. This means values from 0 to 2^maxbits-1 (for universal codes) may be encoded, even if the original code as published starts with 1.
All get_
methods take an optional count as the last argument. If $count
is 1
or not supplied, a single value will be read. If $count
is positive, that many values will be read. If $count
is negative, values are read until the end of the stream.
get_
methods called in list context will return a list of all values read. Called in scalar context they return the last value read.
put_
methods take one or more values as input after any optional parameters and write them to the stream. All values must be non-negative integers that do not exceed the maximum encodable value (typically ~0, but may be lower for some codes depending on parameter, and non-universal codes will be practically limited to smaller values).
- get_unary([$count])
- put_unary(@values)
-
Reads/writes one or more values from the stream in
0000...1
unary coding. Unary coding is only appropriate for relatively small numbers, as it uses$value + 1
bits per value. - get_unary1([$count])
- put_unary1(@values)
-
Reads/writes one or more values from the stream in
1111...0
unary coding. - get_binword($bits, [$count])
- put_binword($bits, @values)
-
Reads/writes one or more values from the stream as fixed-length binary numbers, each using
$bits
bits. - get_gamma([$count])
- put_gamma(@values)
-
Reads/writes one or more values from the stream in Elias Gamma coding.
- get_delta([$count])
- put_delta(@values)
-
Reads/writes one or more values from the stream in Elias Delta coding.
- get_omega([$count])
- put_omega(@values)
-
Reads/writes one or more values from the stream in Elias Omega coding.
- get_levenstein([$count])
- put_levenstein(@values)
-
Reads/writes one or more values from the stream in Levenstein coding (sometimes called Levenshtein or Левенште́йн coding).
- get_evenrodeh([$count])
- put_evenrodeh(@values)
-
Reads/writes one or more values from the stream in Even-Rodeh coding.
- get_goldbach_g1([$count])
- put_goldbach_g1(@values)
-
Reads/writes one or more values from the stream in Goldbach G1 coding.
- get_goldbach_g2([$count])
- put_goldbach_g2(@values)
-
Reads/writes one or more values from the stream in Goldbach G2 coding.
- get_fib([$count])
- put_fib(@values)
-
Reads/writes one or more values from the stream in Fibonacci coding. Specifically, the order
m=2
C1 codes of Fraenkel and Klein. - get_fibgen($m [, $count])
- put_fibgen($m, @values)
-
Reads/writes one or more values from the stream in generalized Fibonacci coding. The order
m
should be between 2 and 16. These codes are described in Klein and Ben-Nissan (2004). Form=2
the results are identical to the standard C1 form. - get_fib_c2([$count])
- put_fib_c2(@values)
-
Reads/writes one or more values from the stream in Fibonacci C2 coding. Specifically, the order
m=2
C2 codes of Fraenkel and Klein. Note that these codes are not prefix-free, hence they will not mix well with other codes in the same stream. - get_comma($bits [, $count])
- put_comma($bits, @values)
-
Reads/writes one or more values from the stream in Comma coding. The number of bits
bits
should be between 1 and 16.bits=1
implies Unary coding.bits=2
is the ternary comma code. No leading zeros are used. - get_blocktaboo($taboo [, $count])
- put_blocktaboo($taboo, @values)
-
Reads/writes one or more values from the stream in block-based Taboo coding. The parameter
taboo
is the binary string of the taboo code to use, such as'00'
.taboo='1'
implies Unary coding.taboo='0'
implies Unary1 coding. No more than 16 bits of taboo code may be given. These codes are a more efficient version of comma codes, as they allow leading zeros. - get_golomb($m [, $count])
- put_golomb($m, @values)
-
Reads/writes one or more values from the stream in Golomb coding.
- get_golomb(sub { ... }, $m [, $count])
- put_golomb(sub { ... }, $m, @values)
-
Reads/writes one or more values from the stream in Golomb coding using the supplied subroutine instead of unary coding, which can make them work with large outliers. For example to use Fibonacci coding for the base:
$stream->put_golomb( sub {shift->put_fib(@_)}, $m, $value); $value = $stream->get_golomb( sub {shift->get_fib(@_)}, $m);
- get_rice($k [, $count])
- put_rice($k, @values)
-
Reads/writes one or more values from the stream in Rice coding, which is the time efficient case where
m = 2^k
. - get_rice(sub { ... }, $k [, $count])
- put_rice(sub { ... }, $k, @values)
-
Reads/writes one or more values from the stream in Rice coding using the supplied subroutine instead of unary coding, which can make them work with large outliers. For example to use Omega coding for the base:
$stream->put_rice( sub {shift->put_omega(@_)}, $k, $value); $value = $stream->get_rice( sub {shift->get_omega(@_)}, $k);
- get_gammagolomb($m [, $count])
- put_gammagolomb($m, @values)
-
Reads/writes one or more values from the stream in Golomb coding using Elias Gamma codes for the base. This is a convenience since they are common.
- get_expgolomb($k [, $count])
- put_expgolomb($k, @values)
-
Reads/writes one or more values from the stream in Rice coding using Elias Gamma codes for the base. This is a convenience since they are common.
- get_baer($k [, $count])
- put_baer($k, @values)
-
Reads/writes one or more values from the stream in Baer c_k coding. The parameter
k
must be between-32
and32
. - get_boldivigna($k [, $count])
- put_boldivigna($k, @values)
-
Reads/writes one or more values from the stream in the Zeta coding of Paolo Boldi and Sebastiano Vigna. The parameter
k
must be between1
andmaxbits
(32
or64
). Typical values fork
are between2
and6
. - get_arice(sub { ... }, $k [, $count])
- put_arice(sub { ... }, $k, @values)
-
Reads/writes one or more values from the stream in Adaptive Rice coding using the supplied subroutine instead of Elias Gamma coding to encode the base. The value of $k will adapt to better fit the values. This interface will likely change to make
$k
a reference. - get_startstop(\@m [, $count])
- put_startstop(\@m, @values)
-
Reads/writes one or more values using Start/Stop codes. The parameter is an array reference which can be an anonymous array, for example:
$stream->put_startstop( [0,3,2,0], @array ); my @array2 = $stream->get_startstop( [0,3,2,0], -1);
- get_startstepstop(\@m [, $count])
- put_startstepstop(\@m, @values)
-
Reads/writes one or more values using Start-Step-Stop codes. The parameter is an array reference which can be an anonymous array, for example:
$stream->put_startstepstop( [3,2,9], @array ); my @array3 = $stream->get_startstepstop( [3,2,9], -1);
- code_get($code, [, $count])
- code_put($code, @values )
-
These methods wrap up all the previous encoding and decoding methods in an internal dispatch table.
code
is a text name of the code, such as'Gamma'
,'Fibonacci'
, etc. Codes with parameters are called as'Rice(2)'
,'StartStop(0-0-2-4-14)'
, etc.# $use_rice and $k obtained from options, parameters, or wherever. my $code = $use_rice ? "Rice($k)" : "Delta"; my $nvalues = scalar @values; $stream->code_put($code, @values); # .... my @vals = $stream->code_get($code, $nvalues); print "Read $nvalues values with code '$code': ", join(',', @vals), "\n";
SEE ALSO
The Data::Buffer module has some similarities, and may be easier to use if your structure maps directly to typical C structs. The main feature it has that isn't replicated here is the template functionality. The primary difference is Data::BitStream allows arbitrary bit lengths (it isn't byte oriented), and of course all the different codes. It also allows direct storage of 64-bit integers, and bigints (using binary strings).
- Data::BitStream::Base
- Data::BitStream::WordVec
- Data::BitStream::Code::Gamma
- Data::BitStream::Code::Delta
- Data::BitStream::Code::Omega
- Data::BitStream::Code::Levenstein
- Data::BitStream::Code::EvenRodeh
- Data::BitStream::Code::Fibonacci
- Data::BitStream::Code::Additive
- Data::BitStream::Code::Golomb
- Data::BitStream::Code::Rice
- Data::BitStream::Code::GammaGolomb
- Data::BitStream::Code::ExponentialGolomb
- Data::BitStream::Code::StartStop
- Data::BitStream::Code::Baer
- Data::BitStream::Code::BoldiVigna
- Data::BitStream::Code::Comma
- Data::BitStream::Code::Taboo
- Data::BitStream::Code::ARice
AUTHORS
Dana Jacobsen <dana@acm.org>
COPYRIGHT
Copyright 2011-2013 by Dana Jacobsen <dana@acm.org>
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.