NAME
Compress::Huffman - Huffman encode a symbol table
SYNOPSIS
# Turn an alphabet in the form of a hash from symbols to
# probabilities into a binary Huffman table.
use Compress::Huffman;
my $cf = Compress::Huffman->new ();
my %symbols = (
a => 0.5,
b => 0.25,
c => 0.125,
d => 0.125,
);
$cf->symbols (\%symbols);
my $table = $cf->table ();
for my $k (sort keys %symbols) {
print "$k: <$table->{$k}> ";
}
print "\n";
# Turn an alphabet in the form of a hash from symbols to weights
# into a tertiary Huffman table.
$cf->symbols (\%symbols, size => 3, notprob => 1);
my $table3 = $cf->table ();
for my $k (sort keys %symbols) {
print "$k: <$table3->{$k}> ";
}
print "\n";
produces output
a: <1> b: <01> c: <000> d: <001>
a: <2> b: <1> c: <02> d: <01>
(This example is included as synopsis.pl in the distribution.)
VERSION
This documents version 0.08 of Compress-Huffman corresponding to git commit 2232ae39b5afa66e9e150706e07747447cd72ea2 released on Thu Jan 10 15:25:30 2019 +0900.
DESCRIPTION
Compress::Huffman converts a table of symbols and probabilities to a Huffman encoded form. The Huffman encoding is output as strings using a restricted alphabet, so binary encoding is a string of the form 0101
rather than genuine binary.
METHODS
new
my $cf = Compress::Huffman->new ();
This returns a new object.
symbols
$cf->symbols (\%symbols);
This converts a table of symbols into a Huffman code. The keys of %symbols
are any set of symbols you want to encode. The values of %symbols
must be numeric (this is checked with "looks_like_number" in Scalar::Util). Usually a Huffman code is used to compress a set of symbols associated with probability values, so usually the numerical values of %symbols
should sum to one. If the values of %symbols
do not sum to one, use the "notprob" option.
This method takes the following options, supplied as a hash after the initial hash reference:
- size
-
$cf->symbols (\%symbols, size => 3);
This specifies the size of the output alphabet. If the size is not specified, the default is binary output, in other words the default value for
size
is two. For values ofsize
up to 10, the numerals 0 to 9 are used to encode the output. Ifsize
is greater than 10, specify an alphabet with the "alphabet" option. - notprob
-
my %symbols = (a => 1, b => 2, c => 3); $cf->symbols (\%symbols, notprob => 1);
This specifies that the values of
%symbols
are not a probability distribution. The function then calculates the probabilities of each symbol by summing the values of%symbols
and dividing each value by the calculated sum. In either case, the values of%symbols
must be numeric. - alphabet
-
$cf->symbols (\%symbols, alphabet => ['a', 'b', 'c']);
The default behaviour of "symbols" is to form a Huffman code using digits. The default Huffman code with "size" equal to two (binary Huffman encoding) encodes
%symbols
using a string consisting of 0's and 1's. For a different encoding, or if you set size to a value more than ten, specify the alphabet of symbols usingalphabet => \@alphabet
, where@alphabet
is the list of symbols you want to use. - verbose
-
$cf->symbols (\%symbols, verbose => 1);
Any true value turns on debugging messages.
xl
my $expected_length = $cf->xl ();
This returns the expected length of the Huffman encoding, which is the sum of the probability of each symbol multiplied by the length of the code associated with that symbol. Please see, for example, "Elements of Information Theory" for details.
encode_array
my $huffout = $cf->encode (\@string);
Encode the string of symbols in @string
into a Huffman-encoded format $huffout
. The return value is an array reference.
If @string
contains a symbol which is not in the symbol table, a warning is printed and the symbol is skipped over.
encode
my $huffout = $cf->encode (\@string);
Encode the symbols in @string
into a Huffman-encoded format $huffout
. The behaviour is identical to "encode_array" except that the return value is a string.
use utf8;
use Compress::Huffman;
my $ch = Compress::Huffman->new ();
my %symbols = (あ => 100, い => 200, う => 300, え => 400, お => 1000);
$ch->symbols (\%symbols, notprob => 1);
my $msg = $ch->encode ([qw/あ い う え お/]);
print "Huffman encoding is $msg\n";
my $recovered = $ch->decode ($msg);
print "Recovered @$recovered\n";
produces output
Huffman encoding is 01100111010001
Recovered あ い う え お
decode
my $string = $cf->decode ($huffout);
Decode the huffman-encoded $huffout
into symbols. The return value, $string
in the example, is an array reference containing the symbols supplied in symbols
. Please see "encode" for an example.
table
my $table = $cf->table ();
The return value is a hash reference which is the symbol-to-Huffman code table within $cf
. This is not a copy, so it shouldn't be altered by the user.
save
my $saved = $n->save ();
Save the Huffman table in JSON format to $saved
.
This was added in version 0.04.
load
my $n = Compress::Huffman->new ();
$n->load ($saved);
Load a Huffman table in JSON format from $saved
.
This was added in version 0.04.
ALGORITHM
(This is copied from the documentation of Algorithm::Huffman)
Here is an example illustrating the algorithm. Given the characters and occurences
a(15) b(7) c(6) d(6) e(5),
in the first step e and d are the rarest symbols, so we create this new heap and tree structure:
a(15) de(11) b(7) c(6)
de
/ \
"0"/ \"1"
d e
Now combine the next two rarest characters, b and c:
a(15) bc(13) de(11)
de bc
/ \ / \
"0"/ \"1" "0"/ \"1"
d e b c
Now combine the next two rarest characters, bc and de.
a(15) bcde(24)
bcde
/ \
"0"/ \"1"
/ \
de bc
/ \ / \
"0"/ \"1" "0"/ \"1"
d e b c
The next step unifies all the steps:
Huffman-Table
/ \
"0"/ \"1"
/ \
/ \
bcde a
/ \
"0"/ \"1"
/ \
de bc
/ \ / \
"0"/ \"1" "0"/ \"1"
d e b c
Finally this encoding table is created from the labels on the branches:
a 1
b 010
c 011
d 000
e 001
Huffman coding is ambiguous, and there is no rule defining what element in the tree is ordered to the left or right, so it's also possible to reverse the 0s and 1s to get:
a 0
b 100
c 101
d 110
e 111
In the case of non-binary Huffman encodings, dummy elements may also have to be added to the tree.
Please see "References" for more information about Huffman encodings.
DIAGNOSTICS
- Bad size $size for Huffman table, must be integer >= 2
-
(Fatal) The user attempted to make a Huffman table which isn't possible.
- Use $o->symbols (\%t, alphabet => ['a', 'b',...]) for table sizes bigger than 10
-
(Fatal) The user must supply an alphabet for the Huffman codes if size > 10
- Symbol table has no entries
-
(Fatal) User supplied an empty hash reference in "symbols".
- Non-numerical value '$s->{$k}' for key '$k'
-
(Fatal) User supplied a hash reference with a non-numerical value in "symbols".
- Input values don't sum to 1.0; use $o->symbols (\%s, notprob => 1) if not a probability distribution
-
(Fatal) User supplied a non-probability distribution hash table to "symbols".
- Symbol 'X' is not in the symbol table
-
(Warning) The user tried to encode a list of symbols with "encode" which contained a symbol
X
which was not in the symbol table. - Input starting from XYZ was not Huffman encoded using this table
-
(Warning) The user tried to decode a Huffman-encoded string with "decode", but it doesn't seem to have been encoded using the table associated with the object.
- Bad object
-
(Fatal) A user-supplied object didn't contain the correct tables.
- Value N for symbol X is not a probability
-
(Fatal) A user-supplied value for symbol X was either greater than 1 or less than 0, and the user did not specify "notprob".
- Negative weight N for symbol X
-
(Fatal) The weighting for symbol X was negative.
BUGS
The module is slow for large symbol tables. For example a hash with 100,000 entries might take several minutes to finish building the table.
Huffman encoding itself does not achieve any compression when there are, for example, two symbols A and B with probabilities 0.001 and 0.999, since every symbol takes at least one bit of information to encode. Arithmetic encoding and asymmetric numeral systems are two forms of encoding designed to overcome this issue.
DEPENDENCIES
JSON::Parse and JSON::Create are used by "save" and "load" for storing the Huffman table. The module also uses the core modules Carp, "looks_like_number" in Scalar::Util for detecting numbers, and "ceil" in POSIX to calculate the size of tables needed in encoding.
SEE ALSO
CPAN modules
- Algorithm::Huffman
-
Perl extension that implements the Huffman algorithm
Unfortunately this is 17 years old and seems to fail all its tests on most platforms.
Other software
- Rosetta code examples
-
This includes a Perl example.
References
- Original article
-
David A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes", Proceedings of the Institute of Radio Engineers, volume 40, number 9, pages 1098-1102, September 1952.
- Elements of Information Theory
-
Elements of Information Theory by Thomas Cover and Joy Thomas, (either the first or second edition) details Huffman encoding.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2015-2019 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.