Data
compression has been an important subject since the dawn of electronic age. It
was more so when the cost of transmissions and the cost of storing information
on the storage devices were high. Nowadays both of these costs are much lower
than before. Is there still a place for data compression?
It
is true that many of us no longer compress (or zip) our data files these days.
But for photos or video images we are still using data compression. For
example, the Jpeg format that we use today (which has a compression ratio of
10:1) are based on very complicated compression techniques.
Data
compression can be traced back to Claude Shannon’s 1948 paper “A Mathematical
Theory of Communication”, where he introduced the concept of “entropy” to
measure the statistical nature of the source.
By source we mean the set of objects or symbols we will assemble to
transmit or store.
If
the source consists of 4 symbols s1, s2, s3, s4, to transmit these symbols a
simple way is to use 2 bits for their representations. For example, 00 for
s1, 01 for s2, 10 for s3, and 11 for s4.
Within
this article all logarithm functions mentioned will have the base of 2. So log
n = log(2, n)
In
general, if we have n symbols in the source, we need to use log n bits (round up to the nearest integer if necessary)
to represent all these symbols. This is because if we use m binary bits to represent n symbols, then 2 to the power of m should equal or greater than
n (2^m >= n). Therefore by the definition of logarithm we
have m >= log n.
So
we need a string at least of log 4 = 2 bits to represent 4 symbols.
Example
1: The source is still s1, s2, s3, s4. But
now we assume the probability for s1 to occur is ½, the probability for s2 is ¼, the
probabilities for s3 and s4 are both 1/8.
If we accept a variable length coding scheme then we can reduce the
“average length” for these symbols.
In
this case we use 0 for s1, 10 for s2, 110 for s3, and 111 for s4. So the number
of bits used (the length) for coding these 4 symbols are 1, 2, 3, 3 respectively.
The
average length of this coding scheme is therefore 1 * ½
+ 2 * ¼ + 3 * 1/8 + 3 * 1/8 = 1.75
This
number is also the expected value of the code length in Probability Theory.
Finding
a coding scheme with the shortest average length is one of the main goals in
coding theory.
A
“real life example” of s1, s2, s3,s4
above are the sunny, cloudy, light rain,
and heavy rain weathers for a town in Florida, and the probabilities are
coming from their weather statistics. In that case to tell people in New York
or Toronto their current weather in Florida they only need to transmit one
single symbol, and the average length is only 1.75.
This
variable length transmission does have its own disadvantages. It is more
difficult for the receiving end to decode the message. Furthermore, it may
cause some confusion or ambiguity when decoding the message. For example, once you use 10 for s2, you
can’t use 101 or 100 for the other symbols unless you have another means to
resolve the ambiguity. Since there is no
stop bit to mark the symbol boundary the coded message may in some case be
interpreted in more than one way if 10 appear once more as a prefix (or the
front part) of the codeword for another symbol.
The
solution is to restrict your coding scheme to “prefix-free”. In other
words, a symbol’s code can not appear again as a prefix of the codeword for
another symbol.
One
may think that a coding scheme has to be decodable to its original source (this
kind of compression is called lossless compression). But there are lossy
compressions being used today. The Jpeg coding scheme for storing photo
files doesn’t display the exact image, yet we all use it because of its high compression
ratio and the discrepancy it generates is not noticeable by human eyes.
However, for text messages we normally demand lossless compressions. A uniquely
decodable scheme in variable-length codes ensures that
codewords can be recognized unambiguously so that the decoding process is the
exact inverse of the encoding process. A prefix-free coding scheme is always
uniquely decodable. Please note that some authors use the term “prefix
codes” for the prefix-free coding method.
Exercise:
The source consists of A (coded as 1),
B (coded as 10), C (coded as 000), and D (coded as 100). Please decode the message 101100000101100.
Is
this coding scheme uniquely decodable?
Is it also prefix-free?
Huffman
Coding Method
David
Huffman, a student in MIT published the paper "A Method for the
Construction of Minimum-Redundancy Codes" in 1952, in which he proposed a
procedure to construct a prefix-free coding scheme to minimize the average code
length.
The
first step of this procedure is to arrange these symbols in the decreasing
order of their probabilities, then repeatedly join two nodes (or symbols) with
the smallest probabilities to form a new node with the sum of the probabilities
just joined. Assign ‘1’ to one branch (the lower one in following example) and ‘0’ to the other branch (upper one). When we reach the highest level that has the
probability of 1 then we start constructing the code based on the 1 and 0 that
we have assigned. Let’s use the
following example for illustration.
Example
2: Symbol/ Probability: A/0.3
B/0.3 C/0.13 D/0.12
E/0.1 F/0.05
Please
note that it has already been arranged in decreasing order of probabilities.
Step
1: We should first work from the lowest
pair E and F, their combined probability is 0.15. We assign ‘0’ to E and ‘1’ to
F. When combing two symbols we always assign ‘0’ to the symbol with higher
probability (the one above) and ‘1’ to the one with lower probability (the one
below).
A/0.3
B/0.3
C/0.13
D/0.12
E/0.1 ‘0’
EF 0.15
F/0.05 ‘1’
Step
2: Do the same to C and D because they have the lowest probability at this
point:
A/0.3
B/0.3
C/0.13 ‘0’
CD 0.25
D/0.12 ‘1’
E/0.1 ‘0 ‘
EF 0.15
F/0.05 ‘1’
Step
3: Combine CD with EF to form CDEF because 0.15 and 0.25 are the lowest pair.
A/0.3
B/0.3
C/0.13 ‘0’
CD 0.25 ‘0’
D/0.12 ‘1’
CDEF 0.4
E/0.1 ‘0 ‘
EF 0.15 ‘1’
F/0.05 ‘1’
Step
4: Combine A and B to form AB, and then
combine AB and CDEF:
A/0.3 ‘0’
AB 0.6
B/0.3 ‘1’
‘0’
ABCDEF 1
‘1’
C/0.13 ‘0’
CD 0.25 ‘0’
D/0.12 ‘1’
CDEF 0.4
E/0.1 ‘0 ‘
EF 0.15 ‘1’
F/0.05 ‘1’
Now
we reach the highest level that has the combined probability of 1 so we can
start constructing the code. To obtain the code for A we start from the highest
level, which is ‘0’ for A and B and ‘1’ for CDEF, then go to next level, which
is A itself, and that is also ‘0’.
So
A is represented by 00.
B
is represented by 01
C
is represented by 100
D
is represented by 101
E
is represented by 110
F
is represented by 111.
The
average length = 0.3*2 + 0.3*2 + 0.13*3
+ 0.12*3 + 0.1*3 + 0.05*3 = 2.4 bits
Certainly
this is a prefix-free coding scheme.
When
we create this “tree” we start from the lowest probability (the bottom one),
and work our way up. The one with highest probability at the top will be
handled last. In this way the bottom one with lower probability will always
represented by more bits because the number of bits will increase by one each
time we move up one level.
The following article shows the frequencies of each
of the 26 English letters based on a sample of 40,000 words:
https://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
https://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
In
theory you can construct a prefix-free coding scheme for English messages with the
lowest average length.
According
to Shannon this type of coding scheme is based on the First Order Statistic.
If
you want to take into account the conditional probability of a character
following another character, then the coding scheme has to be based on Second
Order Statistic. We all know that if character “q” appears in a message the
next character is almost certain to be “u”. In the Second Order Statistic table of the
previous article that probability is about 0.9949749. Coding schemes using Second Order Statistic
are more complicated and we are not going to discuss further.
In
the previous two examples we notice that the number of bits required for coding
a symbol is inversely related to the probability of that symbol. In example 1 we have
s1 s2 s3
s4
p ½ ¼
1/8 1/8
length 1
2 3 3
Since
log 2 = 1, log 4= 2 and log 8 = 3, in this example we have log 1/p = codeword length for the
symbol, and the average length = p1 *
log 1/p1 + p2 * log 1/p2 + p3 * log 1/p3 + p4 * log 1/p4 = - ( p1 * log p1 + p2 * log p2 + p3 * log
p3 + p4 * log p4 ) = 1.75 = average
length.
The
amount - ( p1 * log p1 + p2 * log p2 + p3 * log p3 + p4
* log p4 ) is called the entropy of the source.
So
in example 1 the average length of our coding scheme equals to the entropy of
the source.
However,
in example 2
A B
C D E
F
p 0.3 0.3
0.13 0.12 0.1
0.05
length 2
2 3 3
3 3
Entropy
of the source =
-( p1 * log p1 + p2 * log p2 + p3 * log p3 + p4
* log p4 + p5 * log p5 + p6 * log p6)
= - (.3*log .3 + .3*log .3 + .13 *log .13 +
. 12*log .12 + .1 * log .1 + .05 * log .05)
=
2.34018, which is not the same as the
average length of 2.4.
In general, the average
length of any uniquely decodable coding scheme must
be greater than or equal to
the entropy of the source. This is Shannon’s Source Coding Theorem of
1948.
When the average length of a
uniquely decodable coding scheme equals to the entropy of the source, the
coding is optimal. So the coding scheme in Example 1 is optimal.
So far we are talking about binary coding,
the bits we use are either 0 or 1. Now we temporaryly lift such a restriction. Suppose we have a source of n symbols s1, s2, …sn and each of them are coded by a string of alphabets selected from a pool of r distinct alphabets (which is similar to our 0 and 1). If there exists a uniquely decodable coding scheme with codeword lengths l1, l 2, …. ,l n, then
1/
+ 1/
+ .....+ 1/
<= 1
Conversely, given a set of numbers l1, l 2, …., l n that satisfy this inequality, there exists a uniquely decodable coding scheme over a set of r alphabets with these codework lengths.
This
is the famous Kraft-McMillan Inequality in Coding theory (1956).
In
example 1 we have r = 2 and codeword lengths 1, 2, 3, 3.
and ½ + 1 / (2^2)
+ 1/ (2^3) + 1/(2^3) = 1
In
example 2 we have r = 2 and codeword
lengths 2, 2, 3, 3, 3,3
and 1 / (2^2) + 1/ (2^2) + 1/(2^3) + 1/(2^3) + 1/(2^3) +
1/(2^3) = 1
When
the calculated number at the left side of the inequality reaches 1 we know that
there is almost no room for improvement in terms of codeword lengths. If we
shorten the length for any codeword then the calculated number will go beyond
1.
The
Hullman coding method is still being used today. Part of the Jpeg coding scheme
is based on Hullman method: