SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Programming Hive
   Reading #4

    @just_do_neet
Chapter 11. and 15.

            •Chapter 11. ‘Other File Formats and
             Compression’

                •Choosing / Enabling / Action / HAR / etc...

            •Chapter 15. ‘Customizing Hive File and Record
             Formats’

                •Demystifying DML / File Formats / etc...

                •exclude "SerDe" related topics at this
                 presentation...


Programming Hive Reading #4                                    3
#11 Determining Installed Codecs

        $ hive -e "set io.compression.codecs"
        io.compression.codecs=
         org.apache.hadoop.io.compress.GzipCodec,
         org.apache.hadoop.io.compress.DefaultCodec,
         com.hadoop.compression.lzo.LzoCodec,
         org.apache.hadoop.io.compress.SnappyCodec




Programming Hive Reading #4                            4
#11 Choosing a Compression Codec

            •Advantage :

                •network I/O , disk space.

            •Disadvantage :

                •CPU overhead.

            •to be short... : Trade-off




Programming Hive Reading #4                  5
#11 Choosing a Compression Codec

            •“why do we need different compression
             schemes?”

                •speed

                •minimizing size

                •‘splittable’ or not.




Programming Hive Reading #4                          6
#11 Choosing a Compression Codec

            •“why do we need different compression
             schemes?”




                              http://comphadoop.weebly.com/




Programming Hive Reading #4                                   7
take a break : algorithm

            •lossless compression

                •LZ77(LZSS), LZ78, etc...

                     •DEFLATE (LZ77 with Huffman coding)

                     •LZH (LZ77 with Static Huffman coding)

                •BZIP2(Burrows–Wheeler transform, Move-to-
                 Front, Huffman Coding)

            •lossy

                •for JPEG, MPEG,etc...(snip.)
Programming Hive Reading #4                                   8
take a break : algorithm




                        http://www.slideshare.net/moaikids/ss-2638826



Programming Hive Reading #4                                             9
take a break : algorithm




                        http://www.slideshare.net/moaikids/ss-2638826



Programming Hive Reading #4                                             10
take a break : algorithm

            •Burrows–Wheeler Transform(BWT)

                •block sorting

            •“abracadabra” = bwt“ard$rcaaabb”
                  abracadabra$   $abracadabra   a   $   a
                  bracadabra$a   a$abracadabr   r   a   b
                  racadabra$ab   abra$abracad   d   a   r
                  acadabra$abr   abracadabra$   $   a   a
                  cadabra$abra   acadabra$abr   r   a   c
                  adabra$abrac   adabra$abrac   c   a   a
                  dabra$abraca   bra$abracada   a   b   d
                  abra$abracad   bracadabra$a   a   b   a
                  bra$abracada   cadabra$abra   a   c   b
                  ra$abracadab   dabra$abraca   a   d   r
                  a$abracadabr   ra$abracadab   b   r   a
                  $abracadabra   racadabra$ab   b   r   $



Programming Hive Reading #4                                 11
take a break : algorithm

            •BWT with Suffix Array

                •ref. http://d.hatena.ne.jp/naoya/20081016/1224173077

                •ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt




Programming Hive Reading #4                                                          12
take a break : algorithm

            •LZO

                •“Compression is comparable in speed to
                 DEFLATE compression.”

                •“Very fast decompression”
                • http://www.oberhumer.com/opensource/lzo/




Programming Hive Reading #4                                  13
take a break : algorithm

            •Google Snappy

                •“very high speeds and reasonable
                 compression”
                • https://code.google.com/p/snappy/


            •ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889




Programming Hive Reading #4                                                       14
take a break : algorithm

            •LZ4

                •“very fast lossless compression algorithm”
                • https://code.google.com/p/lz4/


            •ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4




Programming Hive Reading #4                                              15
take a break : algorithm

            •“Add support for LZ4 compression”

                •fix version : 0.23.1, 0.24.0,(CDH4)

                •ref. https://issues.apache.org/jira/browse/HADOOP-7657




Programming Hive Reading #4                                               16
take a break : Implementation Codec

  public HogeCodec implements CompressionCodec{
   @Override
   public CompressionOutputStream createOutputStream(OutputStream out,
                           Compressor compressor)
      throws IOException {
     return new BlockCompressorStream(out, compressor, bufferSize,
       compressionOverhead);
   }

  @Override                                                                           ref.
  public Class<? extends Compressor> getCompressorType() {
    return HogeCompressor.class;
                                                                           http://hadoop.apache.org/
  }                                                                      docs/current/api/org/apache/
  @Override                                                                  hadoop/io/compress/
  public CompressionOutputStream createOutputStream(OutputStream out)       CompressionCodec.html
     throws IOException {
    return createOutputStream(out, createCompressor());
  }

  @Override
  public Compressor createCompressor() {
    return new HogeCompressor();
  }

    @Override
    public CompressionInputStream createInputStream(InputStream in)
       throws IOException {
      return createInputStream(in, createDecompressor());
    }
  ............

Programming Hive Reading #4                                                                         17
#11 Enabling Compression

            •Intermediate Compression(hive, mapred)

            •Final Output Compression(hive, mapred)




Programming Hive Reading #4                           18
#11 Enabling Compression

            •Intermediate Compression(hive, mapred)

                •setting enable flag




Programming Hive Reading #4                           19
#11 Enabling Compression

            •Intermediate Compression(hive, mapred)

                •setting codec




Programming Hive Reading #4                           20
#11 Enabling Compression

            •Final Output Compression(hive, mapred)

                •setting enable flag




Programming Hive Reading #4                           21
#11 Enabling Compression

            •Final Output Compression(hive, mapred)

                •setting codec




Programming Hive Reading #4                           22
#11 Sequence File

            •Sequence File Format


                • Header
                • Record
                     • Record length
                     • Key length
                     • Key
                     • Value
                • A sync-marker every few 100 bytes or so.
                  http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/
                  SequenceFile.html




Programming Hive Reading #4                                                                23
#11 Sequence File

            •Compression Type

                •NONE : nothing to do

                •RECORD : compress on each records

                •BLOCK : compress on each blocks




Programming Hive Reading #4                          24
#11 Compression in Action

            •(DEMO)




Programming Hive Reading #4          25
#11 Archive Partition

            •Using ‘HAR’

                •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html

            •Archiving
              $ SET hive.archive.enabled=true;
              $ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)


            •Unarchiving
              $ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)




Programming Hive Reading #4                                                       26
Break :)
#15 Record Format

            •TEXTFILE

            •SEQUENCEFILE

            •RCFILE

              CREATE TABLE hoge (.
              ........
              )
              STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]




Programming Hive Reading #4                              28
#15 Record Format

            •RCFile(Record Columnar File)

                •fast data loading

                •fast query processing

                •highly efficient storage space utilization

                •a strong adaptivity to dynamic data access
                 patterns.

            •ref. "A Fast and Space-efficient Data Placement Structure in
              MapReduce-based Warehouse Systems (ICDE’11)"
              http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/
              TR-11-4.pdf
Programming Hive Reading #4                                                      29
#15 Record Format

            •RCFile Format
                •1 record = some Row Group

                •1 HDFS Block = some Row Group

                •Row Group
                     •a sync marker
                     •metadata header
                     •table data

                •uses the RLE algorithm to compress ‘metadata
                 header’ section.
Programming Hive Reading #4                                     30
#15 Record Format

            •Implementation of RCFile

                •Input Format

                     •o.a.h.h.ql.io.RCFileInputFormat

                •Output Format

                     •o.a.h.h.ql.io.RCFileOutputFormat

                •SerDe

                     •o.a.h.h.serde2.columnar.ColumnarSerDe

Programming Hive Reading #4                                   31
#15 Record Format

            •Tuning of RCFile

                •“hive.io.rcfile.record.buffer.size”

                     •define “RowGroup” size(default: 4MB)




Programming Hive Reading #4                                 32
#15 Record Format

            •ref. “HDFS and Hive storage - comparing file
             formats and compression methods”
                • http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-
                  compression/


            •"In term of file size, the “RCFILE” format with
             the “default” and “gz” compression achieve the
             best results."

            •"In term of speed, the “RCFILE” formats with the
             “lzo” and “snappy” are very fast while preserving
             a high compression rate."

Programming Hive Reading #4                                                          33
#Appendix - trevni

            •ref. https://github.com/cutting/trevni/

            •ref. http://avro.apache.org/docs/current/trevni/spec.html




Programming Hive Reading #4                                              34
#Appendix - trevni

       file header   column    column      column   column         column       column              column
                                                                                          ......
                                                         file

             number of number of  file          column     column start            number of
    magic                                                                                          block             block
               rows     columns metadata      metadata      position               blocks                   ......

                file header                                                                               column
                                ・name
                                ・type
                     column     ・codec                     block         row        row            row                row
                    metadata    ・etc...                  descriptor                                         ......
                                                                                        block

                                  number of uncompres compress
                                    rows     sed bytes ed bytes

                                          block descriptor


Programming Hive Reading #4                                                                                                  35
#Appendix - ORCFile


            •ref. http://hortonworks.com/blog/100x-
              faster-hive/


            •ref. https://issues.apache.org/jira/browse/
              HIVE-3874


            •ref. https://issues.apache.org/jira/secure/
              attachment/12564124/OrcFileIntro.pptx




Programming Hive Reading #4                                36
#Appendix - ORCFile


            •ref. data size




Programming Hive Reading #4    37
#Appendix - ORCFile


            •ref. comparison




Programming Hive Reading #4    38
#Appendix - Column-Oriented Storage


            •ref. http://arxiv.org/pdf/1105.4252.pdf




Programming Hive Reading #4                            39
#Appendix - more informations




          http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=

Programming Hive Reading #4                                                     40
Thanks for your listening :)

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Jinho Kim
 
Gemification plan of Standard Library on Ruby
Gemification plan of Standard Library on RubyGemification plan of Standard Library on Ruby
Gemification plan of Standard Library on RubyHiroshi SHIBATA
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4琛琳 饶
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_PresentationArjun Shah
 
Shellcode injection
Shellcode injectionShellcode injection
Shellcode injectionDhaval Kapil
 
Fluentd unified logging layer
Fluentd   unified logging layerFluentd   unified logging layer
Fluentd unified logging layerKiyoto Tamura
 
Hashiconf EU 2019 - A Tour of Terraform 0.12
Hashiconf EU 2019 - A Tour of Terraform 0.12Hashiconf EU 2019 - A Tour of Terraform 0.12
Hashiconf EU 2019 - A Tour of Terraform 0.12Mitchell Pronschinske
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011CodeIgniter Conference
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Terraform Modules and Continuous Deployment
Terraform Modules and Continuous DeploymentTerraform Modules and Continuous Deployment
Terraform Modules and Continuous DeploymentZane Williamson
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive HookMinwoo Kim
 
Building a High-Performance Distributed Task Queue on MongoDB
Building a High-Performance Distributed Task Queue on MongoDBBuilding a High-Performance Distributed Task Queue on MongoDB
Building a High-Performance Distributed Task Queue on MongoDBMongoDB
 
The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015craig lehmann
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseMichael Stack
 

Mais procurados (20)

Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501
 
Gemification plan of Standard Library on Ruby
Gemification plan of Standard Library on RubyGemification plan of Standard Library on Ruby
Gemification plan of Standard Library on Ruby
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_Presentation
 
Shellcode injection
Shellcode injectionShellcode injection
Shellcode injection
 
Fluentd unified logging layer
Fluentd   unified logging layerFluentd   unified logging layer
Fluentd unified logging layer
 
Hashiconf EU 2019 - A Tour of Terraform 0.12
Hashiconf EU 2019 - A Tour of Terraform 0.12Hashiconf EU 2019 - A Tour of Terraform 0.12
Hashiconf EU 2019 - A Tour of Terraform 0.12
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Everything as Code with Terraform
Everything as Code with TerraformEverything as Code with Terraform
Everything as Code with Terraform
 
How DSL works on Ruby
How DSL works on RubyHow DSL works on Ruby
How DSL works on Ruby
 
Terraform Modules and Continuous Deployment
Terraform Modules and Continuous DeploymentTerraform Modules and Continuous Deployment
Terraform Modules and Continuous Deployment
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Building a High-Performance Distributed Task Queue on MongoDB
Building a High-Performance Distributed Task Queue on MongoDBBuilding a High-Performance Distributed Task Queue on MongoDB
Building a High-Performance Distributed Task Queue on MongoDB
 
The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015The OMR GC talk - Ruby Kaigi 2015
The OMR GC talk - Ruby Kaigi 2015
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
 

Semelhante a Programming Hive Reading #4

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Geecon 2019 - Taming Code Quality in the Worst Language I Know: Bash
Geecon 2019 - Taming Code Quality  in the Worst Language I Know: BashGeecon 2019 - Taming Code Quality  in the Worst Language I Know: Bash
Geecon 2019 - Taming Code Quality in the Worst Language I Know: BashMichał Kordas
 
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...44CON
 
Caching with Memcached and APC
Caching with Memcached and APCCaching with Memcached and APC
Caching with Memcached and APCBen Ramsey
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
 
Development Workflow Tools for Open-Source PHP Libraries
Development Workflow Tools for Open-Source PHP LibrariesDevelopment Workflow Tools for Open-Source PHP Libraries
Development Workflow Tools for Open-Source PHP LibrariesPantheon
 
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangPractical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangLyon Yang
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Practical introduction to dev ops with chef
Practical introduction to dev ops with chefPractical introduction to dev ops with chef
Practical introduction to dev ops with chefLeanDog
 
Programming in Linux Environment
Programming in Linux EnvironmentProgramming in Linux Environment
Programming in Linux EnvironmentDongho Kang
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Tim Bunce
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
hover.in at CUFP 2009
hover.in at CUFP 2009hover.in at CUFP 2009
hover.in at CUFP 2009Bhasker Kode
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 Sri Ambati
 

Semelhante a Programming Hive Reading #4 (20)

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Geecon 2019 - Taming Code Quality in the Worst Language I Know: Bash
Geecon 2019 - Taming Code Quality  in the Worst Language I Know: BashGeecon 2019 - Taming Code Quality  in the Worst Language I Know: Bash
Geecon 2019 - Taming Code Quality in the Worst Language I Know: Bash
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Mastering composer
Mastering composerMastering composer
Mastering composer
 
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
 
Caching with Memcached and APC
Caching with Memcached and APCCaching with Memcached and APC
Caching with Memcached and APC
 
4.8 apend backups
4.8 apend backups4.8 apend backups
4.8 apend backups
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
Development Workflow Tools for Open-Source PHP Libraries
Development Workflow Tools for Open-Source PHP LibrariesDevelopment Workflow Tools for Open-Source PHP Libraries
Development Workflow Tools for Open-Source PHP Libraries
 
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangPractical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Practical introduction to dev ops with chef
Practical introduction to dev ops with chefPractical introduction to dev ops with chef
Practical introduction to dev ops with chef
 
Programming in Linux Environment
Programming in Linux EnvironmentProgramming in Linux Environment
Programming in Linux Environment
 
101 apend. backups
101 apend. backups101 apend. backups
101 apend. backups
 
Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )Perl Memory Use 201207 (OUTDATED, see 201209 )
Perl Memory Use 201207 (OUTDATED, see 201209 )
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
hover.in at CUFP 2009
hover.in at CUFP 2009hover.in at CUFP 2009
hover.in at CUFP 2009
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 

Mais de moai kids

中国最新ニュースアプリ事情
中国最新ニュースアプリ事情中国最新ニュースアプリ事情
中国最新ニュースアプリ事情moai kids
 
FluentdとRedshiftの素敵な関係
FluentdとRedshiftの素敵な関係FluentdとRedshiftの素敵な関係
FluentdとRedshiftの素敵な関係moai kids
 
Twitterのsnowflakeについて
TwitterのsnowflakeについてTwitterのsnowflakeについて
Twitterのsnowflakeについてmoai kids
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3moai kids
 
"Programming Hive" Reading #1
"Programming Hive" Reading #1"Programming Hive" Reading #1
"Programming Hive" Reading #1moai kids
 
Casual Compression on MongoDB
Casual Compression on MongoDBCasual Compression on MongoDB
Casual Compression on MongoDBmoai kids
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBmoai kids
 
Hadoop Conference Japan 2011 Fallに行ってきました
Hadoop Conference Japan 2011 Fallに行ってきましたHadoop Conference Japan 2011 Fallに行ってきました
Hadoop Conference Japan 2011 Fallに行ってきましたmoai kids
 
HBase本輪読会資料(11章)
HBase本輪読会資料(11章)HBase本輪読会資料(11章)
HBase本輪読会資料(11章)moai kids
 
snappyについて
snappyについてsnappyについて
snappyについてmoai kids
 
第四回月次セミナー(公開版)
第四回月次セミナー(公開版)第四回月次セミナー(公開版)
第四回月次セミナー(公開版)moai kids
 
第三回月次セミナー(公開版)
第三回月次セミナー(公開版)第三回月次セミナー(公開版)
第三回月次セミナー(公開版)moai kids
 
Pythonで自然言語処理
Pythonで自然言語処理Pythonで自然言語処理
Pythonで自然言語処理moai kids
 
HandlerSocket plugin Client for Javaとそれを用いたベンチマーク
HandlerSocket plugin Client for Javaとそれを用いたベンチマークHandlerSocket plugin Client for Javaとそれを用いたベンチマーク
HandlerSocket plugin Client for Javaとそれを用いたベンチマークmoai kids
 
Yammer試用レポート(公開版)
Yammer試用レポート(公開版)Yammer試用レポート(公開版)
Yammer試用レポート(公開版)moai kids
 
掲示板時間軸コーパスを用いたワードトレンド解析(公開版)
掲示板時間軸コーパスを用いたワードトレンド解析(公開版)掲示板時間軸コーパスを用いたワードトレンド解析(公開版)
掲示板時間軸コーパスを用いたワードトレンド解析(公開版)moai kids
 
中国と私(仮題)
中国と私(仮題)中国と私(仮題)
中国と私(仮題)moai kids
 
不自然言語処理コンテストLT資料
不自然言語処理コンテストLT資料不自然言語処理コンテストLT資料
不自然言語処理コンテストLT資料moai kids
 
n-gramコーパスを用いた類義語自動獲得手法について
n-gramコーパスを用いた類義語自動獲得手法についてn-gramコーパスを用いた類義語自動獲得手法について
n-gramコーパスを用いた類義語自動獲得手法についてmoai kids
 
Analysis of ‘lang-8’
Analysis of ‘lang-8’Analysis of ‘lang-8’
Analysis of ‘lang-8’moai kids
 

Mais de moai kids (20)

中国最新ニュースアプリ事情
中国最新ニュースアプリ事情中国最新ニュースアプリ事情
中国最新ニュースアプリ事情
 
FluentdとRedshiftの素敵な関係
FluentdとRedshiftの素敵な関係FluentdとRedshiftの素敵な関係
FluentdとRedshiftの素敵な関係
 
Twitterのsnowflakeについて
TwitterのsnowflakeについてTwitterのsnowflakeについて
Twitterのsnowflakeについて
 
Programming Hive Reading #3
Programming Hive Reading #3Programming Hive Reading #3
Programming Hive Reading #3
 
"Programming Hive" Reading #1
"Programming Hive" Reading #1"Programming Hive" Reading #1
"Programming Hive" Reading #1
 
Casual Compression on MongoDB
Casual Compression on MongoDBCasual Compression on MongoDB
Casual Compression on MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Hadoop Conference Japan 2011 Fallに行ってきました
Hadoop Conference Japan 2011 Fallに行ってきましたHadoop Conference Japan 2011 Fallに行ってきました
Hadoop Conference Japan 2011 Fallに行ってきました
 
HBase本輪読会資料(11章)
HBase本輪読会資料(11章)HBase本輪読会資料(11章)
HBase本輪読会資料(11章)
 
snappyについて
snappyについてsnappyについて
snappyについて
 
第四回月次セミナー(公開版)
第四回月次セミナー(公開版)第四回月次セミナー(公開版)
第四回月次セミナー(公開版)
 
第三回月次セミナー(公開版)
第三回月次セミナー(公開版)第三回月次セミナー(公開版)
第三回月次セミナー(公開版)
 
Pythonで自然言語処理
Pythonで自然言語処理Pythonで自然言語処理
Pythonで自然言語処理
 
HandlerSocket plugin Client for Javaとそれを用いたベンチマーク
HandlerSocket plugin Client for Javaとそれを用いたベンチマークHandlerSocket plugin Client for Javaとそれを用いたベンチマーク
HandlerSocket plugin Client for Javaとそれを用いたベンチマーク
 
Yammer試用レポート(公開版)
Yammer試用レポート(公開版)Yammer試用レポート(公開版)
Yammer試用レポート(公開版)
 
掲示板時間軸コーパスを用いたワードトレンド解析(公開版)
掲示板時間軸コーパスを用いたワードトレンド解析(公開版)掲示板時間軸コーパスを用いたワードトレンド解析(公開版)
掲示板時間軸コーパスを用いたワードトレンド解析(公開版)
 
中国と私(仮題)
中国と私(仮題)中国と私(仮題)
中国と私(仮題)
 
不自然言語処理コンテストLT資料
不自然言語処理コンテストLT資料不自然言語処理コンテストLT資料
不自然言語処理コンテストLT資料
 
n-gramコーパスを用いた類義語自動獲得手法について
n-gramコーパスを用いた類義語自動獲得手法についてn-gramコーパスを用いた類義語自動獲得手法について
n-gramコーパスを用いた類義語自動獲得手法について
 
Analysis of ‘lang-8’
Analysis of ‘lang-8’Analysis of ‘lang-8’
Analysis of ‘lang-8’
 

Programming Hive Reading #4

  • 1. Programming Hive Reading #4 @just_do_neet
  • 2.
  • 3. Chapter 11. and 15. •Chapter 11. ‘Other File Formats and Compression’ •Choosing / Enabling / Action / HAR / etc... •Chapter 15. ‘Customizing Hive File and Record Formats’ •Demystifying DML / File Formats / etc... •exclude "SerDe" related topics at this presentation... Programming Hive Reading #4 3
  • 4. #11 Determining Installed Codecs $ hive -e "set io.compression.codecs" io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodec Programming Hive Reading #4 4
  • 5. #11 Choosing a Compression Codec •Advantage : •network I/O , disk space. •Disadvantage : •CPU overhead. •to be short... : Trade-off Programming Hive Reading #4 5
  • 6. #11 Choosing a Compression Codec •“why do we need different compression schemes?” •speed •minimizing size •‘splittable’ or not. Programming Hive Reading #4 6
  • 7. #11 Choosing a Compression Codec •“why do we need different compression schemes?” http://comphadoop.weebly.com/ Programming Hive Reading #4 7
  • 8. take a break : algorithm •lossless compression •LZ77(LZSS), LZ78, etc... •DEFLATE (LZ77 with Huffman coding) •LZH (LZ77 with Static Huffman coding) •BZIP2(Burrows–Wheeler transform, Move-to- Front, Huffman Coding) •lossy •for JPEG, MPEG,etc...(snip.) Programming Hive Reading #4 8
  • 9. take a break : algorithm http://www.slideshare.net/moaikids/ss-2638826 Programming Hive Reading #4 9
  • 10. take a break : algorithm http://www.slideshare.net/moaikids/ss-2638826 Programming Hive Reading #4 10
  • 11. take a break : algorithm •Burrows–Wheeler Transform(BWT) •block sorting •“abracadabra” = bwt“ard$rcaaabb” abracadabra$ $abracadabra a $ a bracadabra$a a$abracadabr r a b racadabra$ab abra$abracad d a r acadabra$abr abracadabra$ $ a a cadabra$abra acadabra$abr r a c adabra$abrac adabra$abrac c a a dabra$abraca bra$abracada a b d abra$abracad bracadabra$a a b a bra$abracada cadabra$abra a c b ra$abracadab dabra$abraca a d r a$abracadabr ra$abracadab b r a $abracadabra racadabra$ab b r $ Programming Hive Reading #4 11
  • 12. take a break : algorithm •BWT with Suffix Array •ref. http://d.hatena.ne.jp/naoya/20081016/1224173077 •ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt Programming Hive Reading #4 12
  • 13. take a break : algorithm •LZO •“Compression is comparable in speed to DEFLATE compression.” •“Very fast decompression” • http://www.oberhumer.com/opensource/lzo/ Programming Hive Reading #4 13
  • 14. take a break : algorithm •Google Snappy •“very high speeds and reasonable compression” • https://code.google.com/p/snappy/ •ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889 Programming Hive Reading #4 14
  • 15. take a break : algorithm •LZ4 •“very fast lossless compression algorithm” • https://code.google.com/p/lz4/ •ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4 Programming Hive Reading #4 15
  • 16. take a break : algorithm •“Add support for LZ4 compression” •fix version : 0.23.1, 0.24.0,(CDH4) •ref. https://issues.apache.org/jira/browse/HADOOP-7657 Programming Hive Reading #4 16
  • 17. take a break : Implementation Codec public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); } @Override ref. public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; http://hadoop.apache.org/ } docs/current/api/org/apache/ @Override hadoop/io/compress/ public CompressionOutputStream createOutputStream(OutputStream out) CompressionCodec.html throws IOException { return createOutputStream(out, createCompressor()); } @Override public Compressor createCompressor() { return new HogeCompressor(); } @Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); } ............ Programming Hive Reading #4 17
  • 18. #11 Enabling Compression •Intermediate Compression(hive, mapred) •Final Output Compression(hive, mapred) Programming Hive Reading #4 18
  • 19. #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting enable flag Programming Hive Reading #4 19
  • 20. #11 Enabling Compression •Intermediate Compression(hive, mapred) •setting codec Programming Hive Reading #4 20
  • 21. #11 Enabling Compression •Final Output Compression(hive, mapred) •setting enable flag Programming Hive Reading #4 21
  • 22. #11 Enabling Compression •Final Output Compression(hive, mapred) •setting codec Programming Hive Reading #4 22
  • 23. #11 Sequence File •Sequence File Format • Header • Record • Record length • Key length • Key • Value • A sync-marker every few 100 bytes or so. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/ SequenceFile.html Programming Hive Reading #4 23
  • 24. #11 Sequence File •Compression Type •NONE : nothing to do •RECORD : compress on each records •BLOCK : compress on each blocks Programming Hive Reading #4 24
  • 25. #11 Compression in Action •(DEMO) Programming Hive Reading #4 25
  • 26. #11 Archive Partition •Using ‘HAR’ •ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html •Archiving $ SET hive.archive.enabled=true; $ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’) •Unarchiving $ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’) Programming Hive Reading #4 26
  • 28. #15 Record Format •TEXTFILE •SEQUENCEFILE •RCFILE CREATE TABLE hoge (. ........ ) STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE] Programming Hive Reading #4 28
  • 29. #15 Record Format •RCFile(Record Columnar File) •fast data loading •fast query processing •highly efficient storage space utilization •a strong adaptivity to dynamic data access patterns. •ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)" http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/ TR-11-4.pdf Programming Hive Reading #4 29
  • 30. #15 Record Format •RCFile Format •1 record = some Row Group •1 HDFS Block = some Row Group •Row Group •a sync marker •metadata header •table data •uses the RLE algorithm to compress ‘metadata header’ section. Programming Hive Reading #4 30
  • 31. #15 Record Format •Implementation of RCFile •Input Format •o.a.h.h.ql.io.RCFileInputFormat •Output Format •o.a.h.h.ql.io.RCFileOutputFormat •SerDe •o.a.h.h.serde2.columnar.ColumnarSerDe Programming Hive Reading #4 31
  • 32. #15 Record Format •Tuning of RCFile •“hive.io.rcfile.record.buffer.size” •define “RowGroup” size(default: 4MB) Programming Hive Reading #4 32
  • 33. #15 Record Format •ref. “HDFS and Hive storage - comparing file formats and compression methods” • http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format- compression/ •"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results." •"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate." Programming Hive Reading #4 33
  • 34. #Appendix - trevni •ref. https://github.com/cutting/trevni/ •ref. http://avro.apache.org/docs/current/trevni/spec.html Programming Hive Reading #4 34
  • 35. #Appendix - trevni file header column column column column column column column ...... file number of number of file column column start number of magic block block rows columns metadata metadata position blocks ...... file header column ・name ・type column ・codec block row row row row metadata ・etc... descriptor ...... block number of uncompres compress rows sed bytes ed bytes block descriptor Programming Hive Reading #4 35
  • 36. #Appendix - ORCFile •ref. http://hortonworks.com/blog/100x- faster-hive/ •ref. https://issues.apache.org/jira/browse/ HIVE-3874 •ref. https://issues.apache.org/jira/secure/ attachment/12564124/OrcFileIntro.pptx Programming Hive Reading #4 36
  • 37. #Appendix - ORCFile •ref. data size Programming Hive Reading #4 37
  • 38. #Appendix - ORCFile •ref. comparison Programming Hive Reading #4 38
  • 39. #Appendix - Column-Oriented Storage •ref. http://arxiv.org/pdf/1105.4252.pdf Programming Hive Reading #4 39
  • 40. #Appendix - more informations http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr= Programming Hive Reading #4 40
  • 41. Thanks for your listening :)