5. #11 Choosing a Compression Codec
•Advantage :
•network I/O , disk space.
•Disadvantage :
•CPU overhead.
•to be short... : Trade-off
Programming Hive Reading #4 5
6. #11 Choosing a Compression Codec
•“why do we need different compression
schemes?”
•speed
•minimizing size
•‘splittable’ or not.
Programming Hive Reading #4 6
7. #11 Choosing a Compression Codec
•“why do we need different compression
schemes?”
http://comphadoop.weebly.com/
Programming Hive Reading #4 7
8. take a break : algorithm
•lossless compression
•LZ77(LZSS), LZ78, etc...
•DEFLATE (LZ77 with Huffman coding)
•LZH (LZ77 with Static Huffman coding)
•BZIP2(Burrows–Wheeler transform, Move-to-
Front, Huffman Coding)
•lossy
•for JPEG, MPEG,etc...(snip.)
Programming Hive Reading #4 8
9. take a break : algorithm
http://www.slideshare.net/moaikids/ss-2638826
Programming Hive Reading #4 9
10. take a break : algorithm
http://www.slideshare.net/moaikids/ss-2638826
Programming Hive Reading #4 10
11. take a break : algorithm
•Burrows–Wheeler Transform(BWT)
•block sorting
•“abracadabra” = bwt“ard$rcaaabb”
abracadabra$ $abracadabra a $ a
bracadabra$a a$abracadabr r a b
racadabra$ab abra$abracad d a r
acadabra$abr abracadabra$ $ a a
cadabra$abra acadabra$abr r a c
adabra$abrac adabra$abrac c a a
dabra$abraca bra$abracada a b d
abra$abracad bracadabra$a a b a
bra$abracada cadabra$abra a c b
ra$abracadab dabra$abraca a d r
a$abracadabr ra$abracadab b r a
$abracadabra racadabra$ab b r $
Programming Hive Reading #4 11
12. take a break : algorithm
•BWT with Suffix Array
•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077
•ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt
Programming Hive Reading #4 12
13. take a break : algorithm
•LZO
•“Compression is comparable in speed to
DEFLATE compression.”
•“Very fast decompression”
• http://www.oberhumer.com/opensource/lzo/
Programming Hive Reading #4 13
14. take a break : algorithm
•Google Snappy
•“very high speeds and reasonable
compression”
• https://code.google.com/p/snappy/
•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889
Programming Hive Reading #4 14
15. take a break : algorithm
•LZ4
•“very fast lossless compression algorithm”
• https://code.google.com/p/lz4/
•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4
Programming Hive Reading #4 15
16. take a break : algorithm
•“Add support for LZ4 compression”
•fix version : 0.23.1, 0.24.0,(CDH4)
•ref. https://issues.apache.org/jira/browse/HADOOP-7657
Programming Hive Reading #4 16
17. take a break : Implementation Codec
public HogeCodec implements CompressionCodec{
@Override
public CompressionOutputStream createOutputStream(OutputStream out,
Compressor compressor)
throws IOException {
return new BlockCompressorStream(out, compressor, bufferSize,
compressionOverhead);
}
@Override ref.
public Class<? extends Compressor> getCompressorType() {
return HogeCompressor.class;
http://hadoop.apache.org/
} docs/current/api/org/apache/
@Override hadoop/io/compress/
public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html
throws IOException {
return createOutputStream(out, createCompressor());
}
@Override
public Compressor createCompressor() {
return new HogeCompressor();
}
@Override
public CompressionInputStream createInputStream(InputStream in)
throws IOException {
return createInputStream(in, createDecompressor());
}
............
Programming Hive Reading #4 17
23. #11 Sequence File
•Sequence File Format
• Header
• Record
• Record length
• Key length
• Key
• Value
• A sync-marker every few 100 bytes or so.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/
SequenceFile.html
Programming Hive Reading #4 23
24. #11 Sequence File
•Compression Type
•NONE : nothing to do
•RECORD : compress on each records
•BLOCK : compress on each blocks
Programming Hive Reading #4 24
28. #15 Record Format
•TEXTFILE
•SEQUENCEFILE
•RCFILE
CREATE TABLE hoge (.
........
)
STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]
Programming Hive Reading #4 28
29. #15 Record Format
•RCFile(Record Columnar File)
•fast data loading
•fast query processing
•highly efficient storage space utilization
•a strong adaptivity to dynamic data access
patterns.
•ref. "A Fast and Space-efficient Data Placement Structure in
MapReduce-based Warehouse Systems (ICDE’11)"
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/
TR-11-4.pdf
Programming Hive Reading #4 29
30. #15 Record Format
•RCFile Format
•1 record = some Row Group
•1 HDFS Block = some Row Group
•Row Group
•a sync marker
•metadata header
•table data
•uses the RLE algorithm to compress ‘metadata
header’ section.
Programming Hive Reading #4 30
31. #15 Record Format
•Implementation of RCFile
•Input Format
•o.a.h.h.ql.io.RCFileInputFormat
•Output Format
•o.a.h.h.ql.io.RCFileOutputFormat
•SerDe
•o.a.h.h.serde2.columnar.ColumnarSerDe
Programming Hive Reading #4 31
32. #15 Record Format
•Tuning of RCFile
•“hive.io.rcfile.record.buffer.size”
•define “RowGroup” size(default: 4MB)
Programming Hive Reading #4 32
33. #15 Record Format
•ref. “HDFS and Hive storage - comparing file
formats and compression methods”
• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-
compression/
•"In term of file size, the “RCFILE” format with
the “default” and “gz” compression achieve the
best results."
•"In term of speed, the “RCFILE” formats with the
“lzo” and “snappy” are very fast while preserving
a high compression rate."
Programming Hive Reading #4 33