NAME

Compress::Huffman - Huffman encode a symbol table

SYNOPSIS

# Turn an alphabet in the form of a hash from symbols to
# probabilities into a binary Huffman table.

use Compress::Huffman;
my $cf = Compress::Huffman->new ();
my %symbols = (
    a => 0.5,
    b => 0.25,
    c => 0.125,
    d => 0.125,
);
$cf->symbols (\%symbols);
my $table = $cf->table ();
for my $k (sort keys %symbols) {
    print "$k: <$table->{$k}> ";
}
print "\n";

# Turn an alphabet in the form of a hash from symbols to weights
# into a tertiary Huffman table.

$cf->symbols (\%symbols, size => 3, notprob => 1);

my $table3 = $cf->table ();
for my $k (sort keys %symbols) {
    print "$k: <$table3->{$k}> ";
}
print "\n";

produces output

a: <1> b: <01> c: <000> d: <001> 
a: <2> b: <1> c: <02> d: <01>

(This example is included as synopsis.pl in the distribution.)

VERSION

This documents version 0.08 of Compress-Huffman corresponding to git commit 2232ae39b5afa66e9e150706e07747447cd72ea2 released on Thu Jan 10 15:25:30 2019 +0900.

DESCRIPTION

Compress::Huffman converts a table of symbols and probabilities to a Huffman encoded form. The Huffman encoding is output as strings using a restricted alphabet, so binary encoding is a string of the form 0101 rather than genuine binary.

METHODS

new

my $cf = Compress::Huffman->new ();

This returns a new object.

symbols

$cf->symbols (\%symbols);

This converts a table of symbols into a Huffman code. The keys of %symbols are any set of symbols you want to encode. The values of %symbols must be numeric (this is checked with "looks_like_number" in Scalar::Util). Usually a Huffman code is used to compress a set of symbols associated with probability values, so usually the numerical values of %symbols should sum to one. If the values of %symbols do not sum to one, use the "notprob" option.

This method takes the following options, supplied as a hash after the initial hash reference:

size

$cf->symbols (\%symbols, size => 3);

This specifies the size of the output alphabet. If the size is not specified, the default is binary output, in other words the default value for size is two. For values of size up to 10, the numerals 0 to 9 are used to encode the output. If size is greater than 10, specify an alphabet with the "alphabet" option.

notprob

my %symbols = (a => 1, b => 2, c => 3);
$cf->symbols (\%symbols, notprob => 1);

This specifies that the values of %symbols are not a probability distribution. The function then calculates the probabilities of each symbol by summing the values of %symbols and dividing each value by the calculated sum. In either case, the values of %symbols must be numeric.

alphabet

$cf->symbols (\%symbols, alphabet => ['a', 'b', 'c']);

The default behaviour of "symbols" is to form a Huffman code using digits. The default Huffman code with "size" equal to two (binary Huffman encoding) encodes %symbols using a string consisting of 0's and 1's. For a different encoding, or if you set size to a value more than ten, specify the alphabet of symbols using alphabet => \@alphabet, where @alphabet is the list of symbols you want to use.

verbose

$cf->symbols (\%symbols, verbose => 1);

Any true value turns on debugging messages.

xl

my $expected_length = $cf->xl ();

This returns the expected length of the Huffman encoding, which is the sum of the probability of each symbol multiplied by the length of the code associated with that symbol. Please see, for example, "Elements of Information Theory" for details.

encode_array

my $huffout = $cf->encode (\@string);

Encode the string of symbols in @string into a Huffman-encoded format $huffout. The return value is an array reference.

If @string contains a symbol which is not in the symbol table, a warning is printed and the symbol is skipped over.

encode

my $huffout = $cf->encode (\@string);

Encode the symbols in @string into a Huffman-encoded format $huffout. The behaviour is identical to "encode_array" except that the return value is a string.

use utf8;
use Compress::Huffman;
my $ch = Compress::Huffman->new ();
my %symbols = (あ => 100, い => 200, う => 300, え => 400, お => 1000);
$ch->symbols (\%symbols, notprob => 1);
my $msg = $ch->encode ([qw/あ い う え お/]);
print "Huffman encoding is $msg\n";
my $recovered = $ch->decode ($msg);
print "Recovered @$recovered\n";

produces output

Huffman encoding is 01100111010001
Recovered あ い う え お

decode

my $string = $cf->decode ($huffout);

Decode the huffman-encoded $huffout into symbols. The return value, $string in the example, is an array reference containing the symbols supplied in symbols. Please see "encode" for an example.

table

my $table = $cf->table ();

The return value is a hash reference which is the symbol-to-Huffman code table within $cf. This is not a copy, so it shouldn't be altered by the user.

save

my $saved = $n->save ();

Save the Huffman table in JSON format to $saved.

This was added in version 0.04.

load

my $n = Compress::Huffman->new ();
$n->load ($saved);

Load a Huffman table in JSON format from $saved.

This was added in version 0.04.

ALGORITHM

(This is copied from the documentation of Algorithm::Huffman)

Here is an example illustrating the algorithm. Given the characters and occurences

a(15) b(7) c(6) d(6) e(5),

in the first step e and d are the rarest symbols, so we create this new heap and tree structure:

a(15) de(11) b(7) c(6)

      de
     /  \
 "0"/    \"1"
   d      e

Now combine the next two rarest characters, b and c:

a(15) bc(13) de(11)

      de                bc
     /  \              /  \
 "0"/    \"1"      "0"/    \"1"
   d      e          b      c

Now combine the next two rarest characters, bc and de.

a(15) bcde(24)

              bcde
            /      \
       "0"/          \"1"
        /              \
      de                bc
     /  \              /  \
 "0"/    \"1"      "0"/    \"1"
   d      e          b      c

The next step unifies all the steps:

                          Huffman-Table
                             /    \
                       "0"/          \"1"
                      /                  \
                  /                          \
             bcde                              a
           /      \
      "0"/          \"1"
       /              \
     de                bc
    /  \              /  \
"0"/    \"1"      "0"/    \"1"
  d      e          b      c

Finally this encoding table is created from the labels on the branches:

Huffman coding is ambiguous, and there is no rule defining what element in the tree is ordered to the left or right, so it's also possible to reverse the 0s and 1s to get:

In the case of non-binary Huffman encodings, dummy elements may also have to be added to the tree.

Please see "References" for more information about Huffman encodings.

DIAGNOSTICS

Bad size $size for Huffman table, must be integer >= 2: (Fatal) The user attempted to make a Huffman table which isn't possible.
Use $o->symbols (\%t, alphabet => ['a', 'b',...]) for table sizes bigger than 10: (Fatal) The user must supply an alphabet for the Huffman codes if size > 10
Symbol table has no entries: (Fatal) User supplied an empty hash reference in "symbols".
Non-numerical value '$s->{$k}' for key '$k': (Fatal) User supplied a hash reference with a non-numerical value in "symbols".
Input values don't sum to 1.0; use $o->symbols (\%s, notprob => 1) if not a probability distribution: (Fatal) User supplied a non-probability distribution hash table to "symbols".
Symbol 'X' is not in the symbol table: (Warning) The user tried to encode a list of symbols with "encode" which contained a symbol X which was not in the symbol table.
Input starting from XYZ was not Huffman encoded using this table: (Warning) The user tried to decode a Huffman-encoded string with "decode", but it doesn't seem to have been encoded using the table associated with the object.
Bad object: (Fatal) A user-supplied object didn't contain the correct tables.
Value N for symbol X is not a probability: (Fatal) A user-supplied value for symbol X was either greater than 1 or less than 0, and the user did not specify "notprob".
Negative weight N for symbol X: (Fatal) The weighting for symbol X was negative.

BUGS

The module is slow for large symbol tables. For example a hash with 100,000 entries might take several minutes to finish building the table.

Huffman encoding itself does not achieve any compression when there are, for example, two symbols A and B with probabilities 0.001 and 0.999, since every symbol takes at least one bit of information to encode. Arithmetic encoding and asymmetric numeral systems are two forms of encoding designed to overcome this issue.

DEPENDENCIES

JSON::Parse and JSON::Create are used by "save" and "load" for storing the Huffman table. The module also uses the core modules Carp, "looks_like_number" in Scalar::Util for detecting numbers, and "ceil" in POSIX to calculate the size of tables needed in encoding.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.

To install Compress::Huffman, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Compress::Huffman

CPAN shell

perl -MCPAN -e shell
install Compress::Huffman

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

VERSION

DESCRIPTION

METHODS

new

symbols

xl

encode_array

encode

decode

table

save

load

ALGORITHM

DIAGNOSTICS

BUGS

DEPENDENCIES

SEE ALSO

CPAN modules

Other software

References

AUTHOR

COPYRIGHT & LICENCE

NAME

SYNOPSIS

VERSION

DESCRIPTION

METHODS

new

symbols

xl

encode_array

encode

decode

table

save

load

ALGORITHM

DIAGNOSTICS

BUGS

DEPENDENCIES

SEE ALSO

CPAN modules

Other software

References

AUTHOR

COPYRIGHT & LICENCE

Module Install Instructions