source coding

Source encoding is the first step in our information transfer pipeline. We will go through various definitions needed to design and quantify an efficient compact code. After that we will use these to implement specific source coding methods.

A source generates a message, these messages consist of “symbols”.
Each symbol consists of “characters” from an alphabet. The alphabet for most electronic systems is {0,1} (binary).

In a memoryless source, there is no correlation between the generated symbols (if I just saw an A that doesn’t mean there is suddenly less probability the next symbol is not an A).
If the probability distribution of the generated symbols is constant, the source is “stationary”.

properties

A source code is:

non-singular: if all codewords are different
uniquely decodable: if there is no message for which the decoding is ambiguous
instantaneously decodable: if it is uniquely decodable and if it is possible to decode each message without waiting for subsequent codewords
a block code: if all codewords have the same length
uniquely decodable if it is a non-singular block code

In general it is not trivial to determine if a code is uniquely decodable.

example source codes 1

An information source that has 4 possible values: s₁, s₂, s₃ and s₄.
We could represent this information using a binary alphabet {0,1} so that:

Code A: s₁=0, s₂=11, s₃=00, s₄=11.
Code B: s₁=0, s₂=11, s₃=00, s₄=010.
Code C: s₁=00, s₂=01, s₃=10, s₄=11.

Code A is not non-singular because A(s₂) = A(s₄) = 11.
Code B is non-singular but not uniquely decodable because 00 can be decoded as s₃ or s₁s₁.
Code C is a non-singular block code of length 2, so it is uniquely decodable.

example source codes 2

Again an information source that has 4 possible values: s₁, s₂, s₃ and s₄ represented using a binary alphabet {0,1}

Code A: s₁=0, s₂=10, s₃=110, s₄=1110.
Code B: s₁=0, s₂=01, s₃=011, s₄=0111.
Code C: s₁=0, s₂=01, s₃=011, s₄=111.

Codes A, B and C are uniquely decodable.
Only code A is instantaneously decodable, e.g. 01011101100100010 = s₁s₂s₄s₃s₁s₂s₁s₁s₂.
Code B is not instantaneously decodable, e.g. 01101011100 = s₃s₂s₄s₁s₁. (when you see the first 0, you don’t know if its s₁ or part of s₂ or s₃, when you see 01 you don’t know if its s₂ or part of s₃).
Code C is not instantaneously decodable, e.g. 011111111….= ? (you won’t know if its s₁s₄s₄.. or s₁s₂s₄… until you can finally see another 0 and can start counting)

Prefix condition

A prefix of a codeword is a substring of the codeword consisting of consecutive characters of the codeword starting from the first character of the codeword.

The prefix condition is a necessary and sufficient condition for a code to be instantaneously decodable. It states that none of the codewords is a prefix of another codeword.

exercise 1

Design an instantaneously decodable binary code with lengths: 3, 2, 3, 2, 2.

show answer

3,2,3,2,2 -> 2, 2, 2, 3, 3 re-ordering the lengths makes it easier to quickly come up with codewords that match the prefix condition.
00
01
10
110
111

A=110, B=00, C=111, D=01, E=10

exercise 2

Design an instantaneously decodable ternary code with lengths: 2, 3, 1, 1, 2.

show answer

lengths: 1, 1, 2, 2, 3 (ternary code, 3 possible values)
0
1
20
21
220

A=20, B=220, C=0, D=1, E=21

exercise 3

Design an instantaneously decodable binary code with lengths: 2, 3, 2, 2, 2.

show answer

lengths: 2, 2, 2, 2, 3 00
01
10
11
ERROR

Can’t create an other code that isn’t a prefix of the already existing ones, so we cannot make an instantaneously decodable code with these lengths.

Properties of instantaneously decodable codes

Easy to prove if a code is instantaneously decodable using the prefix condition
Easy to design an instantaneously decodable code with given lengths
Decoding is easy
Sensitivity to bit errors is much larger if the code is not a block codeword

Kraft’s inequality

Kraft’s inequality is a necessary and sufficient condition for a code with alphabet size r consisting of q codewords with lengths l₁, …, l_q to be instantaneously decodable.

$K = \sum_{i=1}^q r^{-l_i} \le 1$

Vice versa there exists an instantaneously decodable code with these codeword lengths if Kraft’s inequality holds.

example

Code A: s₁ = 0, s₂ = 100, s₃ = 110, s₄ = 111.
Code B: s₁ = 0, s₂ = 100, s₃ = 110, s₄ = 11.
Code C: s₁ = 0, s₂ = 10, s₃ = 110, s₄ = 11.

Code A meets the prefix condition, so it is instantaneously decodable as expected. $K = 2^{-1} + 2^{-3} + 2^{-3} + 2^{-3} = 0.875 \le 1$

Code B does not meet the prefix condition, so it is not instantaneously decodable. (s₄ is a prefix of s₃). $K = 2^{-1} + 2^{-3} + 2^{-3} + 2^{-2} = 1 \le 1$ . This means it is possible to design an instantaneously decodable code with the given codeword lengths. For instance: 0, 110, 111, 10.

Code C does not meet the prefix condition, so it is not instantaneously decodable. $K = 2^{-1} + 2^{-2} + 2^{-3} + 2^{-2} = {9 \over 8} \gt 1$ . This means it is not possible to design an instantaneously decodable code with the given codeword lengths.

Average codeword length

The average codeword length L is defined as:

$L = \sum_{i=1}^q p_i l_i$

with p_i the probability of the symbols and l_i the individual lengths of the q codewords.

A compact code has an average codeword length that is smaller than or equal to the average codeword length of all other uniquely decodable codes with the same source symbols and the same code alphabet.

Lower bound of the average codeword length

Ever instantaneously decodable code of the source S={s₁, …, s_q} with alphabet {0, 1} has an average codeword length L, that is equal to or larger than the entropy of the source:

$L \ge H(S)$

with p_i the probability of the symbols, and l_i the individual lengths of the q codewords and r the alphabet size. The inequality bcomes an equality when
$p_i = r^{-l}$ with i=1,2,...,q

code efficiency

The efficiency of a code is calculated with:

$\eta = {H(S) \over L} * 100%$

A “special source” is a source with symbol probabilityies p_i with i=1,2,...,q such that $log_2({1 \over p_i})$ are integers (p=0.5, or 0.25, or 0.125, or …). It is possible to design an instantaneously decodable code that is 100% efficient with codeword lengths $l_i = log_2({1 \over p_i})$

examples

example code efficiency

Source	p_i	Code A	Code B
s₁	0.5	00	1
s₂	0.1	01	000
s₃	0.2	10	001
s₄	0.2	11	01

The entropy of the source is:

$H( X ) = - \sum_{i=1}^n p_i log_2(p_i)$

$H( X ) = - (0.5 log_2(0.5) + 0.1 log_2(0.1) + 2*0.2 log_2(0.2)) = 1.76096 bits/symbol$ is equal to
$H( X ) = 0.5 log_2( {1 \over 0.5} ) + 0.1 log_2( {1 \over 0.1} ) + 2*0.2 log_2( {1 \over 0.2} ) = 1.76096 bits/symbol$

Average codeword length is:

$L = \sum_{i=1}^q p_i l_i$

$L_a = (0.5*2)+(0.1*2)+2*(0.2*2) = 2$
$L_b = (0.5*1)+(0.1*3)+(0.2*3)+(0.2*2) = 1.8$

Efficiency is:

$\eta = {H(S) \over L} * 100%$

$\eta_a = {1.76096 \over 2} * 100% = 88.04%$
$\eta_b = {1.76096 \over 1.8} * 100% = 97.83%$

example special source

Source	p_i
s₁	0.125
s₂	0.25
s₃	0.5
s₄	0.125

This is a special source because $log_2({1 \over p_i})$ are integers. This means that a 100% efficient compact code can be designed with $l_i = log_2({1 \over p_i})$ .

l₁=3, l₂=2, l₃=1, l₄=3.
E.g. s₁=110, s₂=10, s₃=0, s₄=111.

exercises

exercise 1

Determine if the following codes are uniquely decodable. If not, give two message with the same code.
Determine if the following codes are instantaneously decodable. If not, can you design an instantaneously decodable code with the same codeword lengths? If you can, design such a code.

A: 000, 001, 010, 011, 100, 101
B: 0, 01, 011, 0111, 01111, 011111
C: 0, 10, 110, 1110, 11110, 111110
D: 0, 10, 110, 1110, 1011, 1101
F: 0, 100, 101, 110, 111, 001
E: 0, 10, 1100, s1101, 1110, 1111
H: 1010, 001, 101, 0001, 1101, 1011
G: 01, 011, 10, 1000, 1100, 0111

exercise 2

Can you design an instantaneously decodable code with the following codeword lengths? Give an example.

A: 2 2 2 4 4 4
B: 1 1 2 3 3
C: 1 1 2 2 2 2

exercise 3

A source with 6 symbols has the following probability distribution and codewords:

s₁ = 0; p₁ = 0.3
s₂ = 10; p₂ = 0.2
s₃ = 1110; p₃ = 0.1
s₄ = 1111; p₄ = 0.1
s₅ = 1100; p₅ = 0.2
s₆ = 1101; p₆ = 0.1

What is the efficiency of the code? Is it possible to design a more efficient code? If yes, design such a code and calculate the efficiency of that code.

N-th extension of a source

We refer to Sⁿ as the n-th extension of the source S if the alphabet of Sⁿ consists of all possible sequences of n symbols of S in which all symbols keep their probabilities.
Then the following holds:

$H( S^n ) = nH(S)$

example

S = {s₁, s₂} with p₁=0.1 and p₂=0.9.
S² = {s₁s₁, s₁s₂, s₂s₁, s₂s₂} with p₁₁=0.01, p₁₂=0.09, p₂₁=0.09, p₂₂=0.81.
Then $H( S^n ) = nH(S)$ .

Shannon’s coding theorem

If L is the average codeword length of a compact code for the source S, then: $H(S) \le L \lt H(S)+1$ .
If L_n is the average codeword length of a compact code for the n-th extension of the source S, then: $H(S) \le {L_n \over n} \lt H(S)+{1 \over n}$ .
As a consequence, the code becomes 100% efficient in the limit: $\lim_{n \to \inf} {L_n \over n} = H(S)$ .