Dictionary Based Compression

CHAPTER 6

Compression Techniques

Objectives:

Able to perform data compression
Able to use different compression
techniques

1

Introduction
What is Compression?
Data compression requires the identification and
extraction of source redundancy.
In other words, data compression seeks to
reduce the number of bits used to store or
transmit information.
There are a wide range of compression methods
which can be so unlike one another that they
have little in common except that they compress
data.

The Need For Compression
In terms of storage, the capacity of a storage
device can be effectively increased with
methods that compresses a body of data on its
way to a storage device and decompresses it
when it is retrieved.
In terms of communications, the bandwidth of a
digital communication link can be effectively
increased by compressing data at the sending
end and decompressing data at the receiving
end.

2

A Brief History of Data
Compression
The late 40's were the early years of
Information Theory, the idea of developing
efficient new coding methods was just
starting to be fleshed out. Ideas of entropy,
information content and redundancy were
explored.
One popular notion held that if the
probability of symbols in a message were
known, there ought to be a way to code the
symbols so that the message will take up
less space.

Compression
The first well-known method for compressing
digital signals is now known as Shannon- Fano
coding. Shannon and Fano [~1948]
simultaneously developed this algorithm which
assigns binary codewords to unique symbols
that appear within a given data file.

While Shannon-Fano coding was a great leap
forward, it had the unfortunate luck to be quickly
superseded by an even more efficient coding
system : Huffman Coding.

3

Compression

Huffman coding [1952] shares most
characteristics of Shannon-Fano coding.
Huffman coding could perform effective data
compression by reducing the amount of
redundancy in the coding of symbols.
It has been proven to be the most efficient
fixed-length coding method available.

Compression
In the last fifteen years, Huffman coding has
been replaced by arithmetic coding.
Arithmetic coding bypasses the idea of
replacing an input symbol with a specific
code.
It replaces a stream of input symbols with a
single floating-point output number.
More bits are needed in the output number
for longer, complex messages.

4

Compression

Terminology
• Compressor–Software (or hardware) device that
compresses data
• Decompressor–Software (or hardware) device
that decompresses data
• Codec–Software (or hardware) device that
compresses and decompresses data
• Algorithm–The logic that governs the
compression/decompression process

5

Compression can be
categorized in two broad ways:
Lossless compression
recover the exact original data after
compression.
mainly use for compressing database
records, spreadsheets or word
processing files, where exact
replication of the original is essential.

Compression can be
categorized in two broad ways:
Lossy compression
will result in a certain loss of accuracy in
exchange for a substantial increase in
compression.
more effective when used to compress graphic
images and digitised voice where losses outside
visual or aural perception can be tolerated.
Most lossy compression techniques can be
adjusted to different quality levels, gaining higher
accuracy in exchange for less effective
compression.

6

Lossless Compression
Algorithms:

Lossless Compression
Algorithms:
Dictionary-based compression algorithms
Repetitive Sequence Suppression
Run-length Encoding*
Pattern Substitution
Entropy Encoding*
The Shannon-Fano Algorithm
Huffman Coding*
Arithmetic Coding*

7

Dictionary-based
compression algorithms
Dictionary-based compression algorithms
use a completely different method to
compress data.
They encode variable-length strings of
symbols as single tokens.
The token forms an index to a phrase
dictionary.
If the tokens are smaller than the phrases,
they replace the phrases and compression
occurs.

Dictionary-based
Suppose we want to encode the Oxford
Concise English dictionary which contains
about 159,000 entries. Why not just transmit
each word as an 18 bit number?
Problems:
Too many bits,
everyone needs a dictionary,
only works for English text.
Solution: Find a way to build the dictionary
adaptively.

8

Dictionary-based
Two dictionary- based compression
techniques called LZ77 and LZ78 have been
developed.
LZ77 is a "sliding window" technique in which
the dictionary consists of a set of fixedlength
phrases found in a "window" into the
previously seen text.
LZ78 takes a completely different approach to
building a dictionary. Instead of using
fixedlength phrases from a window into the
text, LZ78 builds phrases up one symbol at a
time, adding a new symbol to an existing
phrase when a match occurs.

Dictionary-based

The LZW Compression Algorithm can
summarised as follows:

9

Example

Dictionary-based
The LZW decompression Algorithm can
summarised as follows:

10

Example:

Dictionary-based compression
algorithms Problem
What if we run out of dictionary space?
Solution 1: Keep track of unused entries
and use LRU
Solution 2: Monitor compression
performance and flush dictionary when
performance is poor.

11

Repetitive
Sequence Suppression

Fairly straight forward to understand
and implement.
Simplicity is their downfall: NOT best
compression ratios.
Some methods have their applications,
e.g.Component of JPEG, Silence
Suppression.

If a sequence a series on & successive
tokens appears
Replace series with a token and a count number
of occurrences.
Usually need to have a special flag to denote
when the repeated token appears
Example
89400000000000000000000000000000000
we can replace with 894f32, where f is the
flag for zero.

12


How Much Compression?
Compression savings depend on the content of
the data.
Applications of this simple compression
technique include:
Suppression of zero’s in a file (Zero Length
Suppression)
Silence in audio data, Pauses in conversation
etc.
Bitmaps
Blanks in text or program source files
Backgrounds in images
Other regular image or data tokens

Run-length Encoding

13

Run-length Encoding

Run-length Encoding

14

Run-length Encoding

Run-length Encoding
Uncompress
Blue White White White White White White Blue
White Blue White White White White White Blue etc.

Compress
1XBlue 6XWhite 1XBlue
1XWhite 1XBlue 4Xwhite 1XBlue 1XWhite
etc.

15

Run-length Encoding

Pattern Substitution

16

Entropy Encoding

The Shannon-Fano Coding
To create a code tree according to Shannon
and Fano an ordered table is required
providing the frequency of any symbol.
Each part of the table will be divided into two
segments.
The algorithm has to ensure that either the
upper and the lower part of the segment
have nearly the same sum of frequencies.
This procedure will be repeated until only
single symbols are left.

17

The Shannon-Fano
Algorithm

The Shannon-Fano
Algorithm

18

The Shannon-Fano
Algorithm

The Shannon-Fano
Algorithm

19

The Shannon-Fano
Algorithm

Example Shannon-Fano
Coding

20

Coding
STEP 1 STEP 2 STEP 3
SYMBOL FREQ SUM CODE SUM CODE SUM CODE

11 8

11

Coding

21

Coding

Huffman Coding

22

Huffman Coding

Huffman Coding

23

Huffman Coding

Huffman Coding

24

Huffman Coding

Huffman Coding

25

Huffman Coding

Huffman Coding

26

Huffman Code: Example


27


Huffman Coding

28

Huffman Coding

Huffman Coding

29

Arithmetic Coding

Arithmetic Coding

30

Arithmetic Coding

Arithmetic Coding

31

Arithmetic Coding

Arithmetic Coding

32

Arithmetic Coding

Arithmetic Coding

33

Arithmetic Coding

Arithmetic Coding

34

Arithmetic Coding

Arithmetic Coding

35

Arithmetic Coding

How to translate range to bit

Example:
1. BACA
low = 0.59375, high = 0.60937.
2. CAEE$
low = 0.33184, high = 0.3322.

36

Decimal
1 2 3 4 5
0.12345 = 1
+ 2+ 3+ 4+ 5
10 10 10 10 10
= 1× 10 −1 + 2 × 10 − 2 + 3 × 10 −3 + 4 × 10 − 4 + 5 × 10 −5

0.12345
x 10-5
x 10-4
x 10-3
x 10-2
x 10-1

Binary
0 . 01010101
0 1 0 1 0 1 0 1
= + 2 + 3 + 4 + 5 + 6 + 7 + 8
21 2 2 2 2 2 2 2
−2 −4 −6 −8
= 1× 2 + 1× 2 + 1× 2 + 1× 2
0.01010
x 2-5
x 2-4
x 2-3
x 2-2
x 2-1

37

Binary to decimal

0.12 = 0.510 What is a value of
0.010101012 in
0.012 = 0.2510 decimal?
0.0012 = 0.12510
0.00012 = 0.062510
0.000012 = 0.0312510

0.033203125

Generating codeword for
encoder

BEGIN [0.33184,0.33220]
code=0;
k=1;
while( value(code) < low )
{
assign 1 to the k-th binary fraction bit;
if ( value(code) > high)
replace the k-th bit by 0;
k = k + 1;
}
END

38

Example1
Range (0.33184,0.33220)

BEGIN Binary
code=0; Decimal
k=1;
while( value(code) < 0.33184 )
{
assign 1 to the k-th binary fraction bit;
if ( value(code) > 0.33220 )
replace the k-th bit by 0;
k = k + 1;
}
END

Example1
Range (0.33184,0.33220)

1. Assign 1 to the first fraction
(codeword=0.12) and compare with low
(0.3318410)
value(0.12=0.510)> 0.3318410 -> out of range
Hence, we assign 0 for the first bit.
value(0.02)< 0.3318410 -> while loop continue

2. Assign 1 to the second fraction (0.012)
=0.2510 which is less then high (0.33220)

39

Example1
Range (0.33184,0.33220)

3. Assign 1 to the third fraction (0.0112)
=0.2510+ 0.12510 = 0.37510 which is bigger
then high (0.33220), so replace the kbit by
0. Now the codeword = 0.0102
4. Assign 1 to the fourth fraction (0.01012) =
0.2510 + 0.062510 =0.312510 which is less
then high (0.33220). Now the codeword =
0.01012
5. Continue…

Example1
Range (0.33184,0.33220)

Eventually, the binary codeword
generate is 0.01010101 which
0.033203125
8 bit binary represent CAEE$

40

Dictionary Based Compression

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Dictionary Based Compression

Semelhante a Dictionary Based Compression (20)

Mais de anithabalaprabhu

Mais de anithabalaprabhu (20)

Último

Último (20)

Dictionary Based Compression