More Related Content Similar to Sunzip user tool for data reduction using huffman algorithm (20) More from Dr Sandeep Kumar Poonia (16) Sunzip user tool for data reduction using huffman algorithm1. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No.-2, May, 2013
RES Publication © 2012 Page | 11
http://www.resindia.org
Sunzip user tool for data reduction using Huffman algorithm
Ramesh Jangid #1
M.Tech-Computer Science
Jagannath University
Jaipur, India
E-mail: engr.ramesh29@gmail.com
Sandeep Kumar#2
Asst. Prof., Computer Science
Jagannath University
Jaipur, India
E-mail: sandpoonia@gmail.com
Abstract: Smart Huffman Compression is a software appliance designed to compress a file in a better way. By functioning as an JSP, it
provides high level abstraction of java Servlet. For example, Smart Huffman Compression encodes the digital information using fewer
bits, reduces the size of file without loss of data in a single, easy-to-manage software appliance form factor. It also provides us the
decompression facility also. Smart Huffman Compression provides our organization with effective solutions to reduce the file size or
lossless compression of data. It also expedites security of data using the encoding functionality. It is necessary to analyze the relationship
between different methods and put them into a framework to better understand and better exploit the possibilities that compression
provides us image compression, data compression, audio compression, video compression etc. [1]
Keywords: Data Reduction, Java Servlet, Compression, Encoding, JSP
I. INTRODUCTION
Smart Huffman Compression Decompression is a software-
application which is designed to simplify for compressing a
file and makes more efficient use of disk space. It also allows
better utilization of bandwidth for transfer of data. Form
factors of data which are easy to manage through this
application are:
Data Compression: Simplify the text compression in
a digital form. The text is encoded using fewer bits
and original text is replaced by the bits.
Image Compression: Includes segmentation,
filtration of pixels, altering the colours to reduce the
size of digital image.
Audio Compression: It helps to reduce the size of
digital audio streams and files. It has the potential to
reduce the transmission bandwidth and storage
requirements of audio data.
Video Compression: Reduce the size of digital video
streams and files. It combines the spatial image
compression and temporal motion compensation. It is
a practical implementation of source coding in
information theory. Video compression typically
operates on square-shaped groups of
neighboring pixels, often called macro blocks.
The concept behind Huffman Algorithm is: It uses a
variable length code for each of the elements within the
information, which analyze the information to determine the
probability of elements within the information. The most
probable elements are coded with a few bits and the least
probable coded with a greater number of bits
II. HUFFMAN ALGORITHM
Huffman Algorithm is a compression technique with variable
length codes. On behalf of the data symbols and their
frequency of occurrence (their probabilities), it constructs a set
of variable-length codeword‘s with the shortest average length
and assigns them to the symbols. It generally produces better
codes, and like the Shannon-Fanon method, it produces the
best variable-length codes when the probabilities of the
symbols are negative powers of 2. The main difference
between the two methods is that Shannon-Fanon constructs its
codes from top to bottom and the bits of each codeword are
constructed from left to right, while Huffman constructs a
code tree from the bottom up and the bits of each codeword
are constructed from right to left.
Huffman Encoding Algorithm
Step 1: Find out the occurrence or probability of a each
symbol in the given text.
Step 2: List all the source symbols in order of decreasing
probability in a tabular format.
Step 3: Combine the probabilities of the two symbols having
the lowest probabilities, and reorder with the resultant
probability in decreasing order. This step is called reduction 1.
Step 4: Repeat step 2 until there are two ordered probabilities
are remaining.
Step 5: Now go back and assign 0 and 1 to the remaining
probabilities that were combined in the previous reduction
step, retaining all assignments made in step 3.
Step 6: Keep regressing this way until the first column is
reached.
Example
Let the given text is:
SIDVICIIISIDIDVI
There are five symbols in this text. Now we find out the
probability of the symbols from this text. These are:
Symbol Probability
'C' 1/16 0.0625
'D' 3/16 0.1875
'I' 8/16 0.5
'S' 2/16 0.125
'V' 2/16 0.125
Now according to step 2
Symbol Probability
‗I‘ 0.5
‗D‘ 0.1875
‗S‘ 0.125
‗V‘ 0.125
2. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No.-2, May, 2013
RES Publication © 2012 Page | 12
http://www.resindia.org
‗C‘ 0.0625
According to step 3, 4, 5
Figure 1: Procedure of Huffman Encoding
Now we can write the codes for particular symbols as:
Symbols Code Code Length
'C' 1001 4
'D' 11 2
'I' 0 1
'S' 101 3
'V' 1000 4
We have the code lengths and we can calculate average code
length for this text is:
Formula:
L = ∑ Pi * Ni ¥ i=1 to m
Where Pi = Probability of symbol at i value
Ni = Code length of symbol at i value
So L = (0.0625*4) + (0.1875*2) + (0.5*1) + (0.125*3) +
(0.125 * 4)
L = 2
Encoded Message becomes:
SIDVICIIISIDIDVI =
101 0 11 1000 0 1001 0 0 0 101 0 11 0 11 1000 0
The spaces are only to make the reading easier. So, the
compressed output takes 32 bits and we need at least 10 bits to
transfer the Huffman tree by sending the code lengths. The
message originally took 48 bits, now it takes at least 42 bits.
The codes are used to construct the Huffman Tree.
Huffman Tree
Figure 2: Huffman Tree
Huffman Decoding
The codes of each symbol are based on the probabilities or
frequencies of occurrence of the symbols. The probabilities or
frequencies have to be written, as side information, on the
output, so that any Huffman decompressor (decoder) will be
able to decompress the data. This is easy, because the
frequencies are integers and the probabilities can be written as
scaled integers. It normally adds just a few hundred bytes to
the output. It is also possible to write the variable-length codes
themselves on the output, but this may be awkward, because
the codes have different sizes.
The algorithm for decoding is simple. Start at the root and read
the first bit off the input i.e. the compressed file. If it is zero,
follow the bottom edge of the tree; if it is one, follow the top
edge. Read the next bit and move another edge toward the
leaves of the tree. When the decoder arrives at a leaf, it finds
there the original, uncompressed symbol (normally its ASCII
code), and that code is emitted by the decoder. The process
starts again at the root with the next bit.
III. HUFFMAN PERFORMANCE
Huffman is the subject of intensive research in data
compression. As we know it is an algebraic approach to
construct the Huffman code. Robert Gallager shows that the
redundancy of Huffman coding is at most p1 + 0.086 where p1
is the probability of the most-common symbol in the alphabet.
The redundancy is the difference between the average
Huffman codeword length and the entropy. Given a large
alphabet, such as the set of letters, digits and punctuation
marks used by a natural language, the largest symbol
probability is typically around 15–20%, bringing the value of
the quantity p1 + 0.086 to around 0.1. This means that
Huffman codes are at most 0.1 bit longer per symbol than an
ideal entropy encoder, such as arithmetic coding. The Huffman
method assumes that the frequencies of occurrence of all the
symbols of the alphabet are known to the compressor. In
practice, the frequencies are seldom, if ever, known in
advance. One approach to this problem is for the compressor
to read the original data twice. The first time, it only counts
the frequencies; the second time, it compresses the data.
Between the two passes, the compressor constructs the
Huffman tree. Such a two-pass method is sometimes called
semi-adaptive and is normally too slow to be practical. The
method that is used in practice is called adaptive (or dynamic)
Huffman coding. This method is the basis of the UNIX
compact program. The method was originally developed by
Faller and Gallagher with substantial improvements by Knuth.
The main idea is for the compressor and the decompressed
to start with an empty Huffman tree and to modify it as
symbols are being read and processed (in the case of the
compressor, the word ―processed‖ means compressed; in the
case of the decompressed, it means decompressed). The
compressor and decompressed should modify the tree in the
same way, so at any point in the process they should use the
same codes, although those codes may change from step to
step. We say that the compressor and decompressed are
3. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No.-2, May, 2013
RES Publication © 2012 Page | 13
http://www.resindia.org
synchronized or that they work in lockstep, although they
don‘t necessarily work together; compression and
decompression normally take place at different times. The
term mirroring is perhaps a better choice. The decoder mirrors
the operations of the encoder. Initially, the compressor starts
with an empty Huffman tree. No symbols have been assigned
codes yet. The first symbol being input is simply written on
the output in its uncompressed form. The symbol is then added
to the tree and a code assigned to it. The next time this symbol
is encountered, its current code is written on the output, and its
frequency incremented by 1. Since this modifies the tree, the
tree is examined to see whether it is still a Huffman tree (best
codes). If not, it is rearranged, an operation that results in
modified codes.
The decompress or mirrors the same steps. When it reads
the uncompressed form of a symbol, it adds it to the tree and
assigns it a code. When it reads a compressed variable-length
code, it scans the current tree to determine what symbol the
code belongs to, and it increments the symbol‘s frequency and
rearranges the tree in the same way as the compressor. It is
immediately clear that the decompressed needs to know
whether the item it has just input is an uncompressed symbol
normally, an 8-bit ASCII code or a variable-length code. To
remove any ambiguity, each uncompressed symbol is
preceded by a special, variable-size escape code. When the
decompress or reads this code, it knows that the next eight bits
are the ASCII code of a symbol that appears in the compressed
file for the first time
The trouble is that the escape code should not be any of the
variable-length codes used for the symbols. These codes,
however, are being modified every time the tree is rearranged,
which is why the escape code should also be modified. A
natural way to do this is to add an empty leaf to the tree, a leaf
with a zero frequency of occurrence, that‘s always assigned to
the 0-branch of the tree. Since the leaf is in the tree, it is
assigned a variable-length code. This code is the escape code
preceding every uncompressed symbol. As the tree is being
rearranged, the position of the empty leaf-and thus its code-
change, but this escape code is always used to identify
uncompressed symbols in the compressed file.
IV. ADVANCEMENTS IN HUFFMAN
Huffman coding is a process that replaces fixed length
symbols of 8-bit bytes with changing length codes. GNU zip ,
also known as GZIP, is a compression technique which is
originally intended to replace the compress program used in
the early Unix systems. GZIP is the advancements in Huffman
algorithm. It is based on an algorithm known as DEFLATE.
This is also a lossless data compression algorithm. It uses both
the LZ77 algorithm and Huffman coding. Essentially, GZIP
refers to the file format of the same name. This format is a 10-
byte header which contains a magic number, which means a
numerical or text value that never changes and is used to
signify a file format or protocol, an unnamed numerical value
that never changes, or distinct values that cannot be mistaken
for anything else, extra headers that may or may not actually
be necessary (original file name, for example), a body that
contains a DEFLATE -compressed payload which is the data
that the headers carry, and an 8-byte footer which contains a
CRC-32 checksum, as well as the actual length of the original
uncompressed data. It is used when huge file is compressed. It
is very beneficial when we need more space and save time. It
compresses file using very low space. GZIP compresses one
large file instead of multiple smaller ones, it can take
advantage of the redundancy in the files to reduce the file size
even further. GZIP is a purely compression tool to compress a
file. But it uses another tool i.e. Tar to archive a file.
Compression is a technique which is used to reduce the size of
a file while Archive is a technique which is used to combine
multiple files into a single one after compression. GZIP
archive all the files into single tarball before compression.
GZIP is used in UNIX like Operating system such as the
Linux distribution.
V. BENIFITS OF HUFFMAN ENCODING
Huffman Encoding is one the best Compression technique. It
is fast, simple and easy to implement. It starts with a set of
symbols whose probability are known and helps us to
construct a code tree. When the tree is completed, it
determines the variable length prefix codeword‘s for the
individual symbols in the text. Huffman Compression
Algorithm is used to handle the following problems.
The implementer of Huffman compressor
/decompress or selects a set of documents that are
judged typically. The implementer analyse the
document and count the occurrence of the each
symbol. Based on the occurrence, he construct the
Huffman code tree. These codes may not conform the
symbol‘s probability of any particular input file i.e.
being compressed. This approach is simple and fast
so it is used in FAX Machines.
It is a two pass compression job which produces the
ideal codewords for the input file. The input file is
read twice so this approach is slow. In the first pass,
the encoder counts the symbol occurrence, and
determines the probability of each symbol. It uses this
information to construct the Huffman codewords for
the input file which is being compressed. In the
second pass, the encoder actually compress the data
by replacing the each symbols with is respective
codewords.
The Adaptive Huffman Compression starts with a
empty Huffman code tree and update tree as the input
symbol are read and processed. When a symbol is
input the tree is searched for it. If the symbol is in the
tree, the codeword is used otherwise it is added to the
tree and a new codeword is assigned to a particular
symbol. In this case the tree is examined and
rearranged to keep it Huffman Code tree. This
process has to be done carefully to make sure that the
decoder can perform it in the same way as the
encoder in lockstep. This is difficult to implement.
4. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No.-2, May, 2013
RES Publication © 2012 Page | 14
http://www.resindia.org
VI. Sample Screen Shots
Step wise steps-
Figure 1
Figure 2
Here shows the sunzip tool for compression in which show the
original file size,distnict chars , compressed file size and
compression ration .fully details of data file format like text
and mp3 shows in the table format below .
VII. COMPARISION TABLE
Type Of File:- TXT
Algorithm
Name
S
No
Original
File Size
Compressed
File Size
Compr
ession
Ratio
Distinct
Charact
ers
Best
HUFFMAN 1 1702 1081 63.51
%
50
2 334 321 96.11
%
45
3 48890 32249 65.96
%
93
SHANNON
FANO
1 1702 1114 65.45
%
50
2 334 331 99.10
%
45
3 48890 33666 68.86
%
93
GZIP 1 1702 812 47.71
%
yes
2 334 183 54.79
%
3 48890 10734 21.96
%
COSMO 1 1702 1335 78.44
%
50
2 334 304 91.02
%
45
3 48890 42880 87.71
%
93
JUNK
CODE
BINARY
1 1702 1205 70.80
%
50
2 334 276 82.63
%
45
3 48890 37950 77.62
%
93
LZW 1 1702 1273 74.79
%
yes
2 334 333 99.70
%
3 48890 23058 47.16
%
Table: 1
Remark:-
Time:- GZIP, HUFFMAN, SHANNON FANO,
JUNK CODE BINARY, LZW
Compression Order:- GZIP, LZW, HUFFMAN,
JUNK CODE BINARY, SHANNON FANO, COSMO.
Space Required:- GZIP, LZW, HUFFMAN,
JUNK CODE BINARY, SHANNON FANO, COSMO.
Here the table 1 the file format is text file and All above the
in table show the data reduction techniques that distinguished
the time, compression ration and space required .
In the table 1 specify that Gzip and Lzw is better
techniques for the compression of text file
In the next table 2 of comparison take the file format mp3 in
which Huffman and Gzip is better technique for data
compression. so Huffman is better provide the data reduction
Type Of File:- MP3
5. International Journal of Modern Computer Science and Applications (IJMCSA) ISSN: 2321-2632 (Online)
Volume No.-1, Issue No.-2, May, 2013
RES Publication © 2012 Page | 15
http://www.resindia.org
Algorithm
Name
S
N
o
Original
File Size
Compress
ed File
Size
Compressi
on Ratio
Distin
ct
Chara
cters
Bette
r
Algo
HUFFMA
N
1 732310
6
7294912 99.62% 256
yes
2 478988
8
4781161 99.82% 256
3 593350
9
5904888 99.52% 256
SHANNO
N FANO
1 732310
6
7404275 101.11% 256
2 478988
8
4865326 101.57% 256
3 593350
9
5986265 100.89% 256
GZIP
1 732310
6
7223973 98.65%
yes
2 478988
8
4733205 98.82%
3 593350
9
5846223 98.53%
JUNK
CODE
BINARY
1 732310
6
7854595 107.26% 256
2 478988
8
5153463 107.59% 256
3 593350
9
6363777 107.25% 256
RLE
1 732310
6
7411408 101.21%
2 478988
8
4834808 100.94%
3 593350
9
5994496 101.03%
LZW
1 732310
6
1018648
3
139.10%
2 478988
8
6820672 142.40%
3 593350
9
8217964 138.50%
Table: 2
Remark:-
Time :- RLE, GZIP, HUFFMAN, SHANNON FANO, JUNK CODE
BINARY, LZW
Compression Order:- GZIP, HUFFMAN, RLE, SHANNON
FANO,
JUNK CODE BINARY, LZW
Space Required:- GZIP, HUFFMAN, RLE, SHANNON FANO,
JUNK CODE BINARY, LZW
VIII. CONCLUSION
Huffman Algorithm is a lossless compression technique.
Huffman is the most efficient but requires two passes over the
data. The amount of compression, of course, depends on the
type of file being compressed. Random data, such as
executable programs or object code files, typically has low
compression resulting in a file which is 50 to 95% of the
original file size. Still images and animation files tend to have
high compression and typically result in a file which is
between only 2 and 20% of the original file size. It should be
noted that once a file has been compressed there is virtually no
gain in compressing it again. Thus storing or transmitting
compressed files over a system which has further compression
will not increase the compression ratio. Huffman codes are
used to differentiate between data i.e. literal values and back
references.
IX. REFERENCES
[1] D. W. Gillman, M. Mohtashemi, and R. L. Rivest, ―On breaking a
Huffman code,‖ IEEE Trans. Inform. Theory, vol. 42, no. 3, pp. 972-
976, May 1996.
[2] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential
Data Compression", IEEE Transactions on
Information Theory, May 1977
[3] Mridual k. M., ―lossless huffman coding technique for image
compression and reconstruction using binary trees ―ijcta ,vol 3 (1),
76-79,feb 2012
[4] A.B.Watson,‖Image, Compression using the DCT "―
,Mathematica Journal, 1995,pp.81-88
[5] J. Ziv and A. Lempel, ``A Universal Algorithm for Sequential
Data Compression,'' IEEE Transactions on
Information Theory, Vol. 23, pp. 337--342, 1977
[6] D.E. Knuth — Dynamic Huffman Coding — Journal of
Algorithms, 6, 1983 pp. 163-180.
[7] Dzung Tien Hoang and Jeffery Scott Vitter .Fast and Efficient
Algorithms for video Compression and Rate
Control, June 20,1998.
AUTHOR’S BIOGRAPHIES
First Author
Ramesh Jangid, M.Tech-CS Student computer science
engineering from Jagannath University, Jaipur. I am the member
of IACSIT, IAENG. I have done B.E-computer science engineering
in 2008 Batch from Rajasthan University, Jaipur since 2008 My
specialization is data structure, computer networking Redhat linux,
Real time system, cloud-computing.
Second Author
Mr. Sandeep Kumar Assistant Professor in computer science
department from Jagannath University, Jaipur. He is M.Tech &, Ph.d
(Pursuing) and various Journal and International Paper published. He
is the member of IACSIT, IAENG. His specialization field of area is
data structure, computer network, artificial intelligence, database
management system.