NAME

Compress::Huffman - Huffman encode a symbol table

SYNOPSIS

# Turn an alphabet in the form of a hash from symbols to
# probabilities into a binary Huffman table.

use Compress::Huffman;
my $cf = Compress::Huffman->new ();
my %symbols = (
    a => 0.5,
    b => 0.25,
    c => 0.125,
    d => 0.125,
);
$cf->symbols (\%symbols);
my $table = $cf->table ();
for my $k (sort keys %symbols) {
    print "$k: <$table->{$k}> ";
}
print "\n";

# Turn an alphabet in the form of a hash from symbols to weights
# into a tertiary Huffman table.

$cf->symbols (\%symbols, size => 3, notprob => 1);

my $table3 = $cf->table ();
for my $k (sort keys %symbols) {
    print "$k: <$table3->{$k}> ";
}
print "\n";

produces output

a: <1> b: <01> c: <000> d: <001> 
a: <2> b: <1> c: <02> d: <01> 

(This example is included as synopsis.pl in the distribution.)

VERSION

This documents version 0.08 of Compress-Huffman corresponding to git commit 2232ae39b5afa66e9e150706e07747447cd72ea2 released on Thu Jan 10 15:25:30 2019 +0900.

DESCRIPTION

Compress::Huffman converts a table of symbols and probabilities to a Huffman encoded form. The Huffman encoding is output as strings using a restricted alphabet, so binary encoding is a string of the form 0101 rather than genuine binary.

METHODS

new

my $cf = Compress::Huffman->new ();

This returns a new object.

symbols

$cf->symbols (\%symbols);

This converts a table of symbols into a Huffman code. The keys of %symbols are any set of symbols you want to encode. The values of %symbols must be numeric (this is checked with "looks_like_number" in Scalar::Util). Usually a Huffman code is used to compress a set of symbols associated with probability values, so usually the numerical values of %symbols should sum to one. If the values of %symbols do not sum to one, use the "notprob" option.

This method takes the following options, supplied as a hash after the initial hash reference:

size
$cf->symbols (\%symbols, size => 3);

This specifies the size of the output alphabet. If the size is not specified, the default is binary output, in other words the default value for size is two. For values of size up to 10, the numerals 0 to 9 are used to encode the output. If size is greater than 10, specify an alphabet with the "alphabet" option.

notprob
my %symbols = (a => 1, b => 2, c => 3);
$cf->symbols (\%symbols, notprob => 1);

This specifies that the values of %symbols are not a probability distribution. The function then calculates the probabilities of each symbol by summing the values of %symbols and dividing each value by the calculated sum. In either case, the values of %symbols must be numeric.

alphabet
$cf->symbols (\%symbols, alphabet => ['a', 'b', 'c']);

The default behaviour of "symbols" is to form a Huffman code using digits. The default Huffman code with "size" equal to two (binary Huffman encoding) encodes %symbols using a string consisting of 0's and 1's. For a different encoding, or if you set size to a value more than ten, specify the alphabet of symbols using alphabet => \@alphabet, where @alphabet is the list of symbols you want to use.

verbose
$cf->symbols (\%symbols, verbose => 1);

Any true value turns on debugging messages.

xl

my $expected_length = $cf->xl ();

This returns the expected length of the Huffman encoding, which is the sum of the probability of each symbol multiplied by the length of the code associated with that symbol. Please see, for example, "Elements of Information Theory" for details.

encode_array

my $huffout = $cf->encode (\@string);

Encode the string of symbols in @string into a Huffman-encoded format $huffout. The return value is an array reference.

If @string contains a symbol which is not in the symbol table, a warning is printed and the symbol is skipped over.

encode

my $huffout = $cf->encode (\@string);

Encode the symbols in @string into a Huffman-encoded format $huffout. The behaviour is identical to "encode_array" except that the return value is a string.

use utf8;
use Compress::Huffman;
my $ch = Compress::Huffman->new ();
my %symbols = (あ => 100, い => 200, う => 300, え => 400, お => 1000);
$ch->symbols (\%symbols, notprob => 1);
my $msg = $ch->encode ([qw/あ い う え お/]);
print "Huffman encoding is $msg\n";
my $recovered = $ch->decode ($msg);
print "Recovered @$recovered\n";

produces output

Huffman encoding is 01100111010001
Recovered あ い う え お

decode

my $string = $cf->decode ($huffout);

Decode the huffman-encoded $huffout into symbols. The return value, $string in the example, is an array reference containing the symbols supplied in symbols. Please see "encode" for an example.

table

my $table = $cf->table ();

The return value is a hash reference which is the symbol-to-Huffman code table within $cf. This is not a copy, so it shouldn't be altered by the user.

save

my $saved = $n->save ();

Save the Huffman table in JSON format to $saved.

This was added in version 0.04.

load

my $n = Compress::Huffman->new ();
$n->load ($saved);

Load a Huffman table in JSON format from $saved.

This was added in version 0.04.

ALGORITHM

(This is copied from the documentation of Algorithm::Huffman)

Here is an example illustrating the algorithm. Given the characters and occurences

a(15) b(7) c(6) d(6) e(5),

in the first step e and d are the rarest symbols, so we create this new heap and tree structure:

a(15) de(11) b(7) c(6)

      de
     /  \
 "0"/    \"1"
   d      e

Now combine the next two rarest characters, b and c:

a(15) bc(13) de(11)

      de                bc
     /  \              /  \
 "0"/    \"1"      "0"/    \"1"
   d      e          b      c

Now combine the next two rarest characters, bc and de.

a(15) bcde(24)

              bcde
            /      \
       "0"/          \"1"
        /              \
      de                bc
     /  \              /  \
 "0"/    \"1"      "0"/    \"1"
   d      e          b      c

The next step unifies all the steps:

                          Huffman-Table
                             /    \
                       "0"/          \"1"
                      /                  \
                  /                          \
             bcde                              a
           /      \
      "0"/          \"1"
       /              \
     de                bc
    /  \              /  \
"0"/    \"1"      "0"/    \"1"
  d      e          b      c

Finally this encoding table is created from the labels on the branches:

a    1
b    010
c    011
d    000
e    001

Huffman coding is ambiguous, and there is no rule defining what element in the tree is ordered to the left or right, so it's also possible to reverse the 0s and 1s to get:

a    0
b    100
c    101
d    110
e    111

In the case of non-binary Huffman encodings, dummy elements may also have to be added to the tree.

Please see "References" for more information about Huffman encodings.

DIAGNOSTICS

Bad size $size for Huffman table, must be integer >= 2

(Fatal) The user attempted to make a Huffman table which isn't possible.

Use $o->symbols (\%t, alphabet => ['a', 'b',...]) for table sizes bigger than 10

(Fatal) The user must supply an alphabet for the Huffman codes if size > 10

Symbol table has no entries

(Fatal) User supplied an empty hash reference in "symbols".

Non-numerical value '$s->{$k}' for key '$k'

(Fatal) User supplied a hash reference with a non-numerical value in "symbols".

Input values don't sum to 1.0; use $o->symbols (\%s, notprob => 1) if not a probability distribution

(Fatal) User supplied a non-probability distribution hash table to "symbols".

Symbol 'X' is not in the symbol table

(Warning) The user tried to encode a list of symbols with "encode" which contained a symbol X which was not in the symbol table.

Input starting from XYZ was not Huffman encoded using this table

(Warning) The user tried to decode a Huffman-encoded string with "decode", but it doesn't seem to have been encoded using the table associated with the object.

Bad object

(Fatal) A user-supplied object didn't contain the correct tables.

Value N for symbol X is not a probability

(Fatal) A user-supplied value for symbol X was either greater than 1 or less than 0, and the user did not specify "notprob".

Negative weight N for symbol X

(Fatal) The weighting for symbol X was negative.

BUGS

The module is slow for large symbol tables. For example a hash with 100,000 entries might take several minutes to finish building the table.

Huffman encoding itself does not achieve any compression when there are, for example, two symbols A and B with probabilities 0.001 and 0.999, since every symbol takes at least one bit of information to encode. Arithmetic encoding and asymmetric numeral systems are two forms of encoding designed to overcome this issue.

DEPENDENCIES

JSON::Parse and JSON::Create are used by "save" and "load" for storing the Huffman table. The module also uses the core modules Carp, "looks_like_number" in Scalar::Util for detecting numbers, and "ceil" in POSIX to calculate the size of tables needed in encoding.

SEE ALSO

CPAN modules

Algorithm::Huffman

Perl extension that implements the Huffman algorithm

Unfortunately this is 17 years old and seems to fail all its tests on most platforms.

Other software

Rosetta code examples

This includes a Perl example.

References

Original article

David A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes", Proceedings of the Institute of Radio Engineers, volume 40, number 9, pages 1098-1102, September 1952.

Elements of Information Theory

Elements of Information Theory by Thomas Cover and Joy Thomas, (either the first or second edition) details Huffman encoding.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2015-2019 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.