Data Compressions
The process of converting an input data stream (the source stream
or the original raw data) into another data stream (the output, or
the compressed stream) that has a smaller size (low-redundancy).
The decompress or decoder converts in the opposite direction.
Data compression Do by:
By removing the Redundancy the term redundancy (similarity or
repetition) .
.
1. Introduction:
1-Faster Transmission. 2-Storing data in Less Memory.
Data compression is popular for two reasons:
The term “stream :
• “is used instead of “File”“Stream” is a more general term
because the compressed data may be transmitted directly to the
decoder, instead of being written to a file and saved. A stream is
either a File or a Buffer in memory.
The term bit stream
• Is also used in the literature to indicate the compressed stream.
Symmetrical compression
• Is the case where the compressor and decompress or use
basically the same algorithm but work in “opposite” directions?
the time required to compress and to decompress are roughly
the same.
In an asymmetric compression method:
• Either the compressor or the Dcompressor may have to work
significantly harder. The time taken for compression is usually
matched longer than decompression.
The probability model:
This concept is important in statistical data compression methods.
When such a method is used, a model for the data has to be
constructed before compression can begin. A typical model is built
by reading the entire input stream, counting the number of times
each symbol appears (its frequency of occurrence), and computing
the probability of occurrence of each symbol. The data stream is
then input again, symbol by symbol, and is compressed using the
information in the probability model Figure (1-1).
Figure (1-1).
2. categories Of compression
Data Compression
Method
Lossy Methods
(image,video,audio)
Lossless Methods
(Text or Programs)
HuffmanRun- length Huffman Shannon- JEPG MPEG MP3
Lossless Data Compression:
• Preserving data quality perfectly without any losing information.
• The quality of the image without any losing of information.
• Based on utilization of statistical redundancy alone.
• Provide low compression ratio.
• Used for Text Files, especially Files containing computer programs also used for
medical images, military, space programs, where keeping all information is
prioritized at the expense of compression.
Note : Two points should be mentioned regarding text file:
• (1) If a text Files contains the source code of a program, many blank spaces can
normally be eliminated, since they are disregarded by the compiler anyway.
• (2) When the output of a word processor is saved in a text file, the file may
contain information about the different fonts used in the text.
3. Lossless data compression
Most text compression methods are either statistical or dictionary
based. The latter class breaks the text into fragments that are saved
in a data structure called a dictionary. When a fragment of new text
is found to be identical to one of the dictionary entries, a pointer to
that entry is written on the compressed stream, to become the
compression of the new fragment. The former class, on the other
hand, consists of methods that develop Statistical models of the
text .a common statistical method consist of a modeling stage
followed by a coding Stage.
4. text compression (lossless)
Data item d occurs consecutive times in the input stream, replace
the n occurrences with the single paired. RLE Text compression
and also it works for image compression.
5. Run – Length Encoding :
Note: RLE it’s a one of the Basic technique that work in lossless Method.
Ex: Using RLE compression for compression this text
A scan line contains a run of numbers...
2_all_is_too_well
...Using run-length Encoding
Sol
2_a2_is_t2_we2
In example above the compression way with text 2_all_is_too_well
will not work. Clearly, the decompressor should have a way to tell
that the first 2 is part of the text while the others are repetition so
they One way to solve this problem is to precede each repetition
with a special escape character (@).
Then it will be :
this string is longer than the original string,
• Because it replaces two consecutive letters with
three characters. We have to adopt the
• Convention that only three or more repetitions of
the same character will be replaced
• With a repetition factor
2_a@2l_is_t@2o_we@2l
1. In plain English text there are not many repetitions. There are many “doubles”
but a “triple” is rare. The most repetitive character is the space. Dashes or
asterisks May sometimes also repeat. In mathematical texts, digits may repeat.
2. The character “@” may be part of the text in the input stream.
3. Since the repetition count is written on the output stream as a byte, it is
limited to Counts of up to 255 figure (2-1). This limitation can be softened
somewhat when we realize that the existence of a repetition count means that
there is a repetition (at least three identical consecutive characters). We may
adopt the convention that a repeat count of 0 means three repeat characters,
which implies that a repeat count of 255 means a run of 258 identical
characters.
RLE Text Disadvantage: -
Figure (2-1)
Source Code Efficiency
L = average length of the code
L = Length code
H(x)= - Entropy
it is the optimal stat
Our intention is that L approaching H but practically L ≥ H(x)
• Huffman coding is credited to David Albert Huffman
• Huffman coding is a popular method for data compression
• Huffman coding is an entropy encoding algorithm used for lossless data
compression.
• Huffman coding is a method of storing strings of data as binary code in efficient
manner
• Huffman coding uses variable length coding which means that symbols in the
data you are encoded are converted in to a binary symbol based on how often
that symbol is used
• There is a way to decide what binary code to give to each character using trees
Huffman coding is a successful compression method used originally
for text Compression. It assumes that each character is stored as an
8-bit ASCII code .
6. Huffman coding:
Huffman’s idea
Instead of using a fixed-length code for each symbol
1. Represent a frequently occurring character in a source
with a shorter code
2. Represent a less frequently occurring one with a longer
code.
3. The total number of bits in this way of representation is,
hopefully, significantly reduced.
The algorithm for Huffman encoding involves the following steps :
1. Constructing a frequency table sorted in descending order.
2. Building a binary tree carrying out iterations until completion of a complete
Binary tree:
a) Merge the last two items (which have the minimum frequencies) of the
Frequency table to form a new combined item with a sum frequency of the
two.
b) Insert the combined item and update the frequency table
3. Deriving Huffman tree Starting at the root, trace down to every leaf (mark ‘0’
for a left branch and ‘1’ for a right branch) .
4. Generating Huffman code: Collecting the 0s and 1s for each path from the root to
a leaf and assigning a 0-1 code word for each symbol .
Character A B C D E F G
Probability 0.1 0.05 0.2 0.15 0.15 0.25 0.1
Example 1 : the example below, We will use a simple alphabet with the following
frequencies of occurrence.
Let we have the text message:
“AFFCCFFGFFAABFFCCBBGGBBCCCCEEFEECCDDDDCCCCDDFFFDDDAD
FDEEDGDEEAEECCEEEAAAEEFFAFCCCCFFFFAFFFGGGDDGFFGG”
1. Constructing The frequency table :
2. Sort the table in descending order :
Character F C D E A G B
Probability 0.25 0.2 0.15 0.15 0.1 0.1 0.05
3. Deriving Huffman tree: Character F C D E A G B
Probability 0.25 0.2 0.15 0.15 0.1 0.1 0.05
4. Generating Huffman code:
Character F C D E A G B
Probability 0.25 0.2 0.15 0.15 0.1 0.1 0.05
Code 01 000 001 100 101 110 111
length 2 3 3 3 3 3 3
Comparison of the use of Huffman coding and the use of 8-bit ASCII or
EBCDIC Coding.
ASCII / EBCDIC: 7 * 8 = 56
Huffman: 20 bits
The average code size is = 2 bit *(0.25) + 3 bit * (0.2+0.15+0.15+0.1+0.1+0.05) = 2.75
bits per symbol
•Saving percentage
L = ∑ Pi li = 2.75 bits H = ∑ - Pi log Pi =2.65
= ( 2.65 /2.75 ) *100 = 97 % H<= L
A D C B E
24 8 10 12 8
A B C D E
24 12 10 8 8
Ex2 Huffman Coding Say we want to encode a text with the characters
“AADDAAAEEAAACCCAAEAADDABAAAABBAABBBAAAA
BBBBCCCAACCCCDDDDEEBBEEE”
1. Constructing The frequency table :
2. Sort the table in descending order :
•Building a binary tree carrying out iterations until completion of a complete
Binary tree:
TEXT A B C D E
Frequency 24 12 10 8 8
Prob. 0.387 0.193 0.161 0.129 0.129
Code 1 000 001 010 011
Length 1 3 3 3 3
Comparison of the use of Huffman coding and the use of 8-bit ASCII
EBCDIC Coding.
ASCII / EBCDIC: 5 * 8 = 40
Huffman: 13 bits
The average code size is = 1 bit *(0.387) + 3 bit * (0.193+0.161+0.129+0.129) =
2.223 bits per symbol
TEXT A B C D E
Frequency 24 12 10 8 8
Prob. 0.387 0.193 0.161 0.129 0.129
Code 0 100 101 110 111
Length 1 3 3 3 3
L = ∑ Pi li = 2.223 bits H = ∑ - Pi log Pi =2.174
= ( 2.174/2.223 ) *100 = 98 % H<= L
7. Shannon – fano coding:
• This is another approach very similar to Huffman coding. In fact, it
is the First well-known coding method.
• It was proposed by C. Shannon (Bell Labs and R. M. Fano (MIT) in
1940.
• The Shannon-Fano coding algorithm also uses the probability of
each Symbol’s occurrence to construct a code in which each code
word can be of different length.
Shannon idea
1. Represent a frequently occurring character in a source with a
shorter code
2. Represent a less frequently occurring one with a longer code.
3. The total number of bits in this way of representation is,
hopefully, significantly reduced.
The algorithm for Shannon-Fano encoding involves the following
steps :
Given a list of symbols, the algorithm involves the following steps
1. Develop a frequency (or probability) table.
2. Sort the table according to frequency (the most frequent one at the top)
3. Divide the table into 2 halves with similar frequency counts so the sum of the
frequencies of Each half are as close as possible.
4. Assign the upper half of the list a 1 and the lower half a 011
5. Recursively apply the step of division (2.) and assignment (3.) to the
Two halves, subdividing groups and adding bits to the code words until Each
symbol has become a corresponding leaf on the tree.
Example: the example below, We will use a simple alphabet with the following
frequencies of occurrence.
“AFFCCFFGFFAABFFCCBBGGBBCCCCEEFEECCDDDDCCCCDDFFFDDDAD
FDEEDGDEEAEECCEEEAAAEEFFAFCCCCFFFFAFFFGGGDDGFFGG”
Character
F C D E A G B
Probability
0.25 0.2 0.15 0.15 0.1 0.1 0.05
Character A B C D E F G
Probability 0.1 0.05 0.2 0.15 0.15 0.25 0.1
1. Constructing The frequency table :
2. Sort the table in descending order :
3. Table division
•Divide the table into two halves so the sum of the frequencies of
Each half are as close as possible.
•Assign one bit of the symbol (e.g. upper group 1s and the lower 0s).
F
C
D
E
A
G
B
First Division
Second Division
Third Division
Fourth Division
5. So we have the following code (consisting of 5 code words) when the recursive
process ends :
Character F C D E A G B
Probability 0.25 0.2 0.15 0.15 0.1 0.1 0.05
Code 11 10 011 010 001 0001 0000
LEANGTH 2 2 3 3 3 4 4
Saving percentage Comparison of the use Shannon-Fano coding and the use of 8-
bit ASCII
EBCDIC Coding :
ASCII / EBCDIC: 7 * 8 = 56 bits
Shannon-Fano: 21 bits
The average code size is = 2 bits *(0.25 + 0.2) + 3 bits * (0.15+0.15+0.1) + 4 bit *
(0.1+0.05) = 2.7 bits per symbol
L = ∑ Pi li = 2.7 bits H = ∑ - Pi log Pi = 2.65
= (2.65 /2.7 ) *100 = 99% H<= L
Compression is used in all types of data to save space and
time. There are two Types of data compression-lossy and
lossless. Lossy techniques are used for Images, videos and
audios, where we can bear data loss. Lossless technique is
used for textual data it can be encoded through run-length,
Huffman. Shannon
8. conclusion :
9. References:
•Data Compression The Complete Reference (4th)
By David Salomon