2. Introduction
What is Compression?
Data compression requires the identification and
extraction of source redundancy.
In other words, data compression seeks to
reduce the number of bits used to store or
transmit information.
There are a wide range of compression methods
which can be so unlike one another that they
have little in common except that they compress
data.
The Need For Compression
In terms of storage, the capacity of a storage
device can be effectively increased with
methods that compresses a body of data on its
way to a storage device and decompresses it
when it is retrieved.
In terms of communications, the bandwidth of a
digital communication link can be effectively
increased by compressing data at the sending
end and decompressing data at the receiving
end.
2
3. A Brief History of Data
Compression
The late 40's were the early years of
Information Theory, the idea of developing
efficient new coding methods was just
starting to be fleshed out. Ideas of entropy,
information content and redundancy were
explored.
One popular notion held that if the
probability of symbols in a message were
known, there ought to be a way to code the
symbols so that the message will take up
less space.
A Brief History of Data
Compression
The first well-known method for compressing
digital signals is now known as Shannon- Fano
coding. Shannon and Fano [~1948]
simultaneously developed this algorithm which
assigns binary codewords to unique symbols
that appear within a given data file.
While Shannon-Fano coding was a great leap
forward, it had the unfortunate luck to be quickly
superseded by an even more efficient coding
system : Huffman Coding.
3
4. A Brief History of Data
Compression
Huffman coding [1952] shares most
characteristics of Shannon-Fano coding.
Huffman coding could perform effective data
compression by reducing the amount of
redundancy in the coding of symbols.
It has been proven to be the most efficient
fixed-length coding method available.
A Brief History of Data
Compression
In the last fifteen years, Huffman coding has
been replaced by arithmetic coding.
Arithmetic coding bypasses the idea of
replacing an input symbol with a specific
code.
It replaces a stream of input symbols with a
single floating-point output number.
More bits are needed in the output number
for longer, complex messages.
4
5. A Brief History of Data
Compression
Terminology
• Compressor–Software (or hardware) device that
compresses data
• Decompressor–Software (or hardware) device
that decompresses data
• Codec–Software (or hardware) device that
compresses and decompresses data
• Algorithm–The logic that governs the
compression/decompression process
5
6. Compression can be
categorized in two broad ways:
Lossless compression
recover the exact original data after
compression.
mainly use for compressing database
records, spreadsheets or word
processing files, where exact
replication of the original is essential.
Compression can be
categorized in two broad ways:
Lossy compression
will result in a certain loss of accuracy in
exchange for a substantial increase in
compression.
more effective when used to compress graphic
images and digitised voice where losses outside
visual or aural perception can be tolerated.
Most lossy compression techniques can be
adjusted to different quality levels, gaining higher
accuracy in exchange for less effective
compression.
6
8. Dictionary-based
compression algorithms
Dictionary-based compression algorithms
use a completely different method to
compress data.
They encode variable-length strings of
symbols as single tokens.
The token forms an index to a phrase
dictionary.
If the tokens are smaller than the phrases,
they replace the phrases and compression
occurs.
Dictionary-based
compression algorithms
Suppose we want to encode the Oxford
Concise English dictionary which contains
about 159,000 entries. Why not just transmit
each word as an 18 bit number?
Problems:
Too many bits,
everyone needs a dictionary,
only works for English text.
Solution: Find a way to build the dictionary
adaptively.
8
9. Dictionary-based
compression algorithms
Two dictionary- based compression
techniques called LZ77 and LZ78 have been
developed.
LZ77 is a "sliding window" technique in which
the dictionary consists of a set of fixed- length
phrases found in a "window" into the
previously seen text.
LZ78 takes a completely different approach to
building a dictionary. Instead of using
fixedlength phrases from a window into the
text, LZ78 builds phrases up one symbol at a
time, adding a new symbol to an existing
phrase when a match occurs.
Dictionary-based
compression algorithms
The LZW Compression Algorithm can
summarised as follows:
9
11. Example:
Dictionary-based compression
algorithms Problem
What if we run out of dictionary space?
Solution 1: Keep track of unused entries
and use LRU
Solution 2: Monitor compression
performance and flush dictionary when
performance is poor.
11
12. Repetitive
Sequence Suppression
Fairly straight forward to understand
and implement.
Simplicity is their downfall: NOT best
compression ratios.
Some methods have their applications,
e.g.Component of JPEG, Silence
Suppression.
Repetitive Sequence Suppression
If a sequence a series on & successive
tokens appears
Replace series with a token and a count number
of occurrences.
Usually need to have a special flag to denote
when the repeated token appears
Example
89400000000000000000000000000000000
we can replace with 894f32, where f is the
flag for zero.
12
13. Repetitive Sequence Suppression
How Much Compression?
Compression savings depend on the content of
the data.
Applications of this simple compression
technique include:
Suppression of zero’s in a file (Zero Length
Suppression)
Silence in audio data, Pauses in conversation
etc.
Bitmaps
Blanks in text or program source files
Backgrounds in images
Other regular image or data tokens
Run-length Encoding
13
15. Run-length Encoding
Run-length Encoding
Uncompress
Blue White White White White White White Blue
White Blue White White White White White Blue etc.
Compress
1XBlue 6XWhite 1XBlue
1XWhite 1XBlue 4Xwhite 1XBlue 1XWhite
etc.
15
17. Entropy Encoding
The Shannon-Fano Coding
To create a code tree according to Shannon
and Fano an ordered table is required
providing the frequency of any symbol.
Each part of the table will be divided into two
segments.
The algorithm has to ensure that either the
upper and the lower part of the segment
have nearly the same sum of frequencies.
This procedure will be repeated until only
single symbols are left.
17
38. Binary to decimal
0.12 = 0.510 What is a value of
0.010101012 in
0.012 = 0.2510 decimal?
0.0012 = 0.12510
0.00012 = 0.062510
0.000012 = 0.0312510
0.033203125
Generating codeword for
encoder
BEGIN [0.33184,0.33220]
code=0;
k=1;
while( value(code) < low )
{
assign 1 to the k-th binary fraction bit;
if ( value(code) > high)
replace the k-th bit by 0;
k = k + 1;
}
END
38
39. Example1
Range (0.33184,0.33220)
BEGIN Binary
code=0; Decimal
k=1;
while( value(code) < 0.33184 )
{
assign 1 to the k-th binary fraction bit;
if ( value(code) > 0.33220 )
replace the k-th bit by 0;
k = k + 1;
}
END
Example1
Range (0.33184,0.33220)
1. Assign 1 to the first fraction
(codeword=0.12) and compare with low
(0.3318410)
value(0.12=0.510)> 0.3318410 -> out of range
Hence, we assign 0 for the first bit.
value(0.02)< 0.3318410 -> while loop continue
2. Assign 1 to the second fraction (0.012)
=0.2510 which is less then high (0.33220)
39
40. Example1
Range (0.33184,0.33220)
3. Assign 1 to the third fraction (0.0112)
=0.2510+ 0.12510 = 0.37510 which is bigger
then high (0.33220), so replace the kbit by
0. Now the codeword = 0.0102
4. Assign 1 to the fourth fraction (0.01012) =
0.2510 + 0.062510 =0.312510 which is less
then high (0.33220). Now the codeword =
0.01012
5. Continue…
Example1
Range (0.33184,0.33220)
Eventually, the binary codeword
generate is 0.01010101 which
0.033203125
8 bit binary represent CAEE$
40