NAME
Algorithm::Huffman - Perl extension that implements the Huffman algorithm
SYNOPSIS
use Algorithm::Huffman;
my $huff = Algorithm::Huffman->new(\%char_counting);
my $encode_hash = $huff->encode_hash;
my $decode_hash = $huff->decode_hash;
DESCRIPTION
This modules implements the huffman algorithm. The aim is to create a good coding scheme for a given list of different characters (or even strings) and their occurence numbers.
ALGORITHM
Please have a look to a good data compression book for a detailed view. However, the algorithm is like every good algorithm very easy.
Assume we have a heap (keys are the characters/strings; values are their occurencies). In each step of the algorithm, the two rarest characters are looked at. Both get a suffix (one "0", the other "1"). They are joined together and will occur from that time as one "element" in the heap with their summed occurencies. The joining creates a tree growing on while the heap is reducing.
Let's take an example. Given are the characters and occurencies.
a (15) b(7) c(6) d(6) e(5)
In the first step e and d are the rarest characters, so we create this new heap and tree structure:
a(15) de(11) b(7) c(6)
de
/ \
"0"/ \"1"
d e
Next Step:
a(15) bc(13) de(11)
de bc
/ \ / \
"0"/ \"1" "0"/ \"1"
d e b c
Next Step:
a(15) bcde(24)
bcde
/ \
"0"/ \"1"
/ \
de bc
/ \ / \
"0"/ \"1" "0"/ \"1"
d e b c
Next Step unifies the rest:
Huffman-Table
/ \
"0"/ \"1"
/ \
/ \
bcde a
/ \
"0"/ \"1"
/ \
de bc
/ \ / \
"0"/ \"1" "0"/ \"1"
d e b c
Finally this encoding table would be created:
a 1
b 010
c 011
d 000
e 001
Please note, that there is no rule defining what element in the tree is ordered to left or to right. So it's also possible to get e.g. the coding scheme:
a 0
b 100
c 101
d 110
e 111
METHODS
- my $huff = Algorithm::Huffman->new( HASHREF )
-
Creates a new Huffman table, based on the given occurencies of characters. The keys of the given hashref are the characters/strings, the values are their occurencies.
A hashref is used, as such a hash can become quite large (e.g. all three letter combinations).
- $huff->encode_hash
-
Returns a reference to the encoding hash. The keys of the encoding hash are the characters/strings passed at the construction. The values are their bit representation. Please note that the bit represantations are strings of ones and zeros is returned and not binary numbers.
- $huff->decode_hash
-
Returns a reference to the decoding hash. The keys of the decoding hash are the bit presentations, while the values are the characters/strings the bitstrings stands for. Please note that the bit represantations are strings of ones and zeros is returned and not binary numbers.
EXPORT
None by default.
BUGS
There is no great parameter validation. That will be changed in future versions.
If a character/string has occurs zero times, it is still coded. At the moment, you have to grep them out before. I don't plan to change it, as it can realistic happen and they would play a role. (Imagine, you would code all three letter combinations found in some english texts, you still would have to code all ASCII characters, even if they don't occur in the texts you have analyzed. Reason is that they could occur in other texts and you would have to guarantee that you can code every text without any lost information)
It isn't tested with a big histogram of characters/strings.
There could be some others, as this code is still in the ALPHA stadium.
TODO
I'll need a encode
and decode
method based on the created internal huffman table should be implemented.
SEE ALSO
Every good book about data compression.
AUTHOR
Janek Schleicher, <bigj@kamelfreund.de>
COPYRIGHT AND LICENSE
Copyright 2002 by Janek Schleicher
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.