Source encoding is the first step in our information transfer pipeline. We will go through various definitions needed to design and quantify an efficient compact code. After that we will use these to implement specific source coding methods.
A source generates a message, these messages consist of “symbols”.
Each symbol consists of “characters” from an alphabet. The alphabet for most electronic systems is {0,1} (binary).
In a memoryless source, there is no correlation between the generated symbols (if I just saw an A that doesn’t mean there is suddenly less probability the next symbol is not an A).
If the probability distribution of the generated symbols is constant, the source is “stationary”.
properties
A source code is:
- non-singular: if all codewords are different
- uniquely decodable: if there is no message for which the decoding is ambiguous
- instantaneously decodable: if it is uniquely decodable and if it is possible to decode each message without waiting for subsequent codewords
- a block code: if all codewords have the same length
- uniquely decodable if it is a non-singular block code
In general it is not trivial to determine if a code is uniquely decodable.
example source codes 1
An information source that has 4 possible values: s1, s2, s3 and s4.
We could represent this information using a binary alphabet {0,1} so that:
- Code A: s1=0, s2=11, s3=00, s4=11.
- Code B: s1=0, s2=11, s3=00, s4=010.
- Code C: s1=00, s2=01, s3=10, s4=11.
Code A is not non-singular because A(s2) = A(s4) = 11.
Code B is non-singular but not uniquely decodable because 00 can be decoded as s3 or s1s1.
Code C is a non-singular block code of length 2, so it is uniquely decodable.
example source codes 2
Again an information source that has 4 possible values: s1, s2, s3 and s4 represented using a binary alphabet {0,1}
- Code A: s1=0, s2=10, s3=110, s4=1110.
- Code B: s1=0, s2=01, s3=011, s4=0111.
- Code C: s1=0, s2=01, s3=011, s4=111.
Codes A, B and C are uniquely decodable.
Only code A is instantaneously decodable, e.g. 01011101100100010 = s1s2s4s3s1s2s1s1s2.
Code B is not instantaneously decodable, e.g. 01101011100 = s3s2s4s1s1. (when you see the first 0, you don’t know if its s1 or part of s2 or s3, when you see 01 you don’t know if its s2 or part of s3).
Code C is not instantaneously decodable, e.g. 011111111….= ? (you won’t know if its s1s4s4.. or s1s2s4… until you can finally see another 0 and can start counting)
Prefix condition
A prefix of a codeword is a substring of the codeword consisting of consecutive characters of the codeword starting from the first character of the codeword.
The prefix condition is a necessary and sufficient condition for a code to be instantaneously decodable. It states that none of the codewords is a prefix of another codeword.
exercise 1
Design an instantaneously decodable binary code with lengths: 3, 2, 3, 2, 2.
show answer
3,2,3,2,2 -> 2, 2, 2, 3, 3 re-ordering the lengths makes it easier to quickly come up with codewords that match the prefix condition.
00
01
10
110
111
A=110, B=00, C=111, D=01, E=10
exercise 2
Design an instantaneously decodable ternary code with lengths: 2, 3, 1, 1, 2.
show answer
lengths: 1, 1, 2, 2, 3 (ternary code, 3 possible values)
0
1
20
21
220
A=20, B=220, C=0, D=1, E=21
exercise 3
Design an instantaneously decodable binary code with lengths: 2, 3, 2, 2, 2.
show answer
lengths: 2, 2, 2, 2, 3
00
01
10
11
ERROR
Can’t create an other code that isn’t a prefix of the already existing ones, so we cannot make an instantaneously decodable code with these lengths.
Properties of instantaneously decodable codes
- Easy to prove if a code is instantaneously decodable using the prefix condition
- Easy to design an instantaneously decodable code with given lengths
- Decoding is easy
- Sensitivity to bit errors is much larger if the code is not a block codeword
Kraft’s inequality
Kraft’s inequality is a necessary and sufficient condition for a code with alphabet size r
consisting of q
codewords with lengths l1, …, lq to be instantaneously decodable.
Vice versa there exists an instantaneously decodable code with these codeword lengths if Kraft’s inequality holds.
example
- Code A: s1 = 0, s2 = 100, s3 = 110, s4 = 111.
- Code B: s1 = 0, s2 = 100, s3 = 110, s4 = 11.
- Code C: s1 = 0, s2 = 10, s3 = 110, s4 = 11.
Code A meets the prefix condition, so it is instantaneously decodable as expected.
Code B does not meet the prefix condition, so it is not instantaneously decodable. (s4 is a prefix of s3). . This means it is possible to design an instantaneously decodable code with the given codeword lengths. For instance: 0, 110, 111, 10.
Code C does not meet the prefix condition, so it is not instantaneously decodable. . This means it is not possible to design an instantaneously decodable code with the given codeword lengths.
Average codeword length
The average codeword length L
is defined as:
with pi the probability of the symbols and li the individual lengths of the q
codewords.
A compact code has an average codeword length that is smaller than or equal to the average codeword length of all other uniquely decodable codes with the same source symbols and the same code alphabet.
Lower bound of the average codeword length
Ever instantaneously decodable code of the source S={s1, …, sq} with alphabet {0, 1} has an average codeword length L
, that is equal to or larger than the entropy of the source:
with pi the probability of the symbols, and li the individual lengths of the q
codewords and r
the alphabet size. The inequality bcomes an equality when
with i=1,2,...,q
code efficiency
The efficiency of a code is calculated with:
A “special source” is a source with symbol probabilityies pi with i=1,2,...,q
such that are integers (p=0.5, or 0.25, or 0.125, or …). It is possible to design an instantaneously decodable code that is 100% efficient with codeword lengths
examples
example code efficiency
Source | pi | Code A | Code B |
---|---|---|---|
s1 | 0.5 | 00 | 1 |
s2 | 0.1 | 01 | 000 |
s3 | 0.2 | 10 | 001 |
s4 | 0.2 | 11 | 01 |
The entropy of the source is:
is equal to
Average codeword length is:
Efficiency is:
example special source
Source | pi |
---|---|
s1 | 0.125 |
s2 | 0.25 |
s3 | 0.5 |
s4 | 0.125 |
This is a special source because are integers. This means that a 100% efficient compact code can be designed with .
l1=3, l2=2, l3=1, l4=3.
E.g. s1=110, s2=10, s3=0, s4=111.
exercises
exercise 1
Determine if the following codes are uniquely decodable. If not, give two message with the same code.
Determine if the following codes are instantaneously decodable. If not, can you design an instantaneously decodable code with the same codeword lengths? If you can, design such a code.
- A: 000, 001, 010, 011, 100, 101
- B: 0, 01, 011, 0111, 01111, 011111
- C: 0, 10, 110, 1110, 11110, 111110
- D: 0, 10, 110, 1110, 1011, 1101
- F: 0, 100, 101, 110, 111, 001
- E: 0, 10, 1100, s1101, 1110, 1111
- H: 1010, 001, 101, 0001, 1101, 1011
- G: 01, 011, 10, 1000, 1100, 0111
exercise 2
Can you design an instantaneously decodable code with the following codeword lengths? Give an example.
- A: 2 2 2 4 4 4
- B: 1 1 2 3 3
- C: 1 1 2 2 2 2
exercise 3
A source with 6 symbols has the following probability distribution and codewords:
- s1 = 0; p1 = 0.3
- s2 = 10; p2 = 0.2
- s3 = 1110; p3 = 0.1
- s4 = 1111; p4 = 0.1
- s5 = 1100; p5 = 0.2
- s6 = 1101; p6 = 0.1
What is the efficiency of the code? Is it possible to design a more efficient code? If yes, design such a code and calculate the efficiency of that code.
N-th extension of a source
We refer to Sn as the n-th extension of the source S if the alphabet of Sn consists of all possible sequences of n
symbols of S in which all symbols keep their probabilities.
Then the following holds:
example
S = {s1, s2} with p1=0.1 and p2=0.9.
S2 = {s1s1, s1s2, s2s1, s2s2} with p11=0.01, p12=0.09, p21=0.09, p22=0.81.
Then .
Shannon’s coding theorem
- If L is the average codeword length of a compact code for the source S, then: .
- If Ln is the average codeword length of a compact code for the n-th extension of the source S, then: .
- As a consequence, the code becomes 100% efficient in the limit: .