2018年1月31日 星期三

Introduction to Coding Theory and Data Compression - Part 1


Data compression has been an important subject since the dawn of electronic age. It was more so when the cost of transmissions and the cost of storing information on the storage devices were high. Nowadays both of these costs are much lower than before. Is there still a place for data compression?

It is true that many of us no longer compress (or zip) our data files these days. But for photos or video images we are still using data compression. For example, the Jpeg format that we use today (which has a compression ratio of 10:1) are based on very complicated compression techniques.

Data compression can be traced back to Claude Shannon’s 1948 paper “A Mathematical Theory of Communication”, where he introduced the concept of “entropy” to measure the statistical nature of the source.  By source we mean the set of objects or symbols we will assemble to transmit or store. 

If the source consists of 4 symbols s1, s2, s3, s4, to transmit these symbols a simple way is to use 2 bits for their representations. For example,  00 for  s1, 01 for s2, 10 for s3, and 11 for s4.

Within this article all logarithm functions mentioned will have the base of 2. So log n = log(2, n)

In general, if we have n symbols in the source, we need to use log n bits  (round up to the nearest integer if necessary) to represent all these symbols. This is because if we use m binary bits to represent n symbols, then 2 to the power of m should equal or greater than n   (2^m >= n).  Therefore by the definition of logarithm we have m >= log n.

So we need a string at least of log 4 = 2 bits to represent 4 symbols.

Example 1:  The source is still s1, s2, s3, s4. But now we assume the probability for s1 to occur is  ½, the probability for s2 is ¼, the probabilities for s3 and s4 are both 1/8.  If we accept a variable length coding scheme then we can reduce the “average length” for these symbols.

In this case we use 0 for s1, 10 for s2, 110 for s3, and 111 for s4. So the number of bits used (the length) for coding these 4 symbols are 1,  2, 3, 3 respectively.

The average length of this coding scheme is therefore  1 * ½  + 2 * ¼ + 3 * 1/8 + 3 * 1/8 = 1.75
This number is also the expected value of the code length in Probability Theory.

Finding a coding scheme with the shortest average length is one of the main goals in coding theory.

A “real life example” of  s1, s2, s3,s4 above are the sunny, cloudy, light rain,  and heavy rain weathers for a town in Florida, and the probabilities are coming from their weather statistics. In that case to tell people in New York or Toronto their current weather in Florida they only need to transmit one single symbol, and the average length is only 1.75.


This variable length transmission does have its own disadvantages. It is more difficult for the receiving end to decode the message. Furthermore, it may cause some confusion or ambiguity when decoding the message.  For example, once you use 10 for s2, you can’t use 101 or 100 for the other symbols unless you have another means to resolve the ambiguity.  Since there is no stop bit to mark the symbol boundary the coded message may in some case be interpreted in more than one way if 10 appear once more as a prefix (or the front part) of the codeword for another symbol.

The solution is to restrict your coding scheme to “prefix-free”. In other words, a symbol’s code can not appear again as a prefix of the codeword for another symbol.

One may think that a coding scheme has to be decodable to its original source (this kind of compression is called lossless compression). But there are lossy compressions being used today. The Jpeg coding scheme for storing photo files doesn’t display the exact image, yet we all use it because of its high compression ratio and the discrepancy it generates is not noticeable by human eyes. However, for text messages we normally demand lossless compressions.  A uniquely decodable scheme in variable-length codes ensures that codewords can be recognized unambiguously so that the decoding process is the exact inverse of the encoding process. A prefix-free coding scheme is always uniquely decodable.  Please note that some authors use the term “prefix codes” for the prefix-free coding method. 
 
Exercise: The source consists of  A  (coded as 1),  B (coded as 10), C (coded as 000), and D (coded as 100).   Please decode the message  101100000101100.
Is this coding scheme uniquely decodable?   Is it also prefix-free?

Huffman Coding Method

David Huffman, a student in MIT published the paper "A Method for the Construction of Minimum-Redundancy Codes" in 1952, in which he proposed a procedure to construct a prefix-free coding scheme to minimize the average code length.

The first step of this procedure is to arrange these symbols in the decreasing order of their probabilities, then repeatedly join two nodes (or symbols) with the smallest probabilities to form a new node with the sum of the probabilities just joined. Assign ‘1’ to one branch (the lower one in following example)  and ‘0’ to the other branch (upper one).  When we reach the highest level that has the probability of 1 then we start constructing the code based on the 1 and 0 that we have assigned.  Let’s use the following example for illustration.

Example 2:    Symbol/ Probability:      A/0.3  B/0.3  C/0.13  D/0.12  E/0.1  F/0.05

Please note that it has already been arranged in decreasing order of probabilities.

Step 1:  We should first work from the lowest pair E and F, their combined probability is 0.15. We assign ‘0’ to E and ‘1’ to F. When combing two symbols we always assign ‘0’ to the symbol with higher probability (the one above) and ‘1’ to the one with lower probability (the one below).

A/0.3   
B/0.3  
C/0.13   
D/0.12   
E/0.1        ‘0’       
                EF 0.15
F/0.05       ‘1’

Step 2: Do the same to C and D because they have the lowest probability at this point:

A/0.3      
B/0.3       
C/0.13       ‘0’
                CD 0.25
D/0.12      ‘1’
E/0.1          ‘0 ‘      
                EF 0.15
F/0.05       ‘1’


Step 3: Combine  CD with EF to form  CDEF because 0.15 and  0.25 are the lowest pair.

A/0.3      
B/0.3       
C/0.13      ‘0’
                CD 0.25        ‘0’
D/0.12       ‘1’
                                 CDEF 0.4
E/0.1         ‘0 ‘      
                EF 0.15         ‘1’
F/0.05        ‘1’


Step 4: Combine A and B to form  AB, and then combine AB and CDEF:

A/0.3        ‘0’
                AB 0.6            
B/0.3        ‘1’                                         
                                                            0’
                                                       ABCDEF  1
                                                             1’
C/0.13      ‘0’
                CD 0.25        ‘0’
D/0.12       ‘1’
                                 CDEF 0.4
E/0.1         ‘0 ‘      
                EF 0.15         ‘1’
F/0.05        ‘1’


Now we reach the highest level that has the combined probability of 1 so we can start constructing the code. To obtain the code for A we start from the highest level, which is ‘0’ for A and B and ‘1’ for CDEF, then go to next level, which is A itself, and that is also ‘0’.

So A is represented by 00.
B is represented by 01
C is represented by 100
D is represented by 101
E is represented by 110
F is represented by 111.

The average length =  0.3*2 + 0.3*2 + 0.13*3 + 0.12*3 + 0.1*3 + 0.05*3  = 2.4 bits
Certainly this is a prefix-free coding scheme.

When we create this “tree” we start from the lowest probability (the bottom one), and work our way up. The one with highest probability at the top will be handled last. In this way the bottom one with lower probability will always represented by more bits because the number of bits will increase by one each time we move up one level.

The following article shows the frequencies of each of  the 26 English letters based on a sample of 40,000 words:
https://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
In theory you can construct a prefix-free coding scheme for English messages with the lowest average length.

According to Shannon this type of coding scheme is based on the First Order Statistic.
If you want to take into account the conditional probability of a character following another character, then the coding scheme has to be based on Second Order Statistic. We all know that if character “q” appears in a message the next character is almost certain to be “u”.  In the Second Order Statistic table of the previous article that probability is about 0.9949749.  Coding schemes using Second Order Statistic are more complicated and we are not going to discuss further.

In the previous two examples we notice that the number of bits required for coding a symbol is inversely related to the probability of that symbol.  In example 1 we have

              s1      s2     s3      s4
p            ½       ¼      1/8    1/8
length     1        2       3       3

Since log 2 = 1, log 4= 2 and log 8 = 3, in this example we have  log 1/p = codeword length for the symbol,  and the average length = p1 * log 1/p1 + p2 * log 1/p2 + p3 * log 1/p3 + p4 * log 1/p4   = - ( p1 * log p1 + p2 * log p2 + p3 * log p3 + p4 * log p4 )  = 1.75 = average length.

The amount  - (  p1 * log p1 + p2 * log p2 + p3 * log p3 + p4 * log p4 ) is called the entropy of the source.

So in example 1 the average length of our coding scheme equals to the entropy of the source.

However, in example 2

              A        B       C       D       E       F
p           0.3       0.3    0.13   0.12   0.1   0.05
length     2        2        3       3         3       3

Entropy of the source =
-(  p1 * log p1 + p2 * log p2 + p3 * log p3 + p4 * log p4 +  p5 * log p5 + p6 * log p6)
=  - (.3*log .3 + .3*log .3 + .13 *log .13 + . 12*log .12 + .1 * log .1 + .05 * log .05)
= 2.34018,   which is not the same as the average length of 2.4.

In general, the average length of any uniquely decodable coding scheme must
be greater than or equal to the entropy of the source. This is Shannon’s Source Coding Theorem of 1948.

When the average length of a uniquely decodable coding scheme equals to the entropy of the source, the coding is optimal. So the coding scheme in Example 1 is optimal.
So far we are talking about binary coding, the bits we use are either 0 or 1. Now we temporaryly lift such a restriction.
Suppose we have a source of n symbols s1, s2, …sn and each of them are coded by a string of alphabets selected from a pool of r distinct alphabets (which is similar to our 0 and 1). If there exists a uniquely decodable coding scheme with codeword lengths  l1, l 2, …. ,l n, then


1/+ 1/+ .....+ 1/ <= 1

 Conversely, given a set of numbers l1, l 2, …., l n  that satisfy this inequality, there exists a uniquely decodable coding scheme over a set of  r alphabets with these codework lengths.

This is the famous Kraft-McMillan Inequality in Coding theory (1956).

In example 1 we have  r = 2  and codeword lengths 1, 2, 3, 3.
and  ½ + 1 / (2^2)  + 1/ (2^3) + 1/(2^3) = 1

In example 2  we have r = 2 and codeword lengths 2, 2, 3, 3, 3,3
  and 1 / (2^2)  + 1/ (2^2) + 1/(2^3) + 1/(2^3) + 1/(2^3) + 1/(2^3) = 1

When the calculated number at the left side of the inequality reaches 1 we know that there is almost no room for improvement in terms of codeword lengths. If we shorten the length for any codeword then the calculated number will go beyond 1.

The Hullman coding method is still being used today. Part of the Jpeg coding scheme is based on Hullman method:







沒有留言: