SlideShare uma empresa Scribd logo
1 de 22
Efficient Object Model in Java


Slides by Zheng Shao, Facebook
Part of Apache Hadoop Hive Project
Object Inspector
On-disk Data Format
▪   Single on-disk form system
                       at     s
    ▪   Simplicity
▪   Multiple on-disk form system
                         at     s
    ▪   Ease-of-use
    ▪   Ease-of-integration
    ▪   Flexibility: better trade off between space, performance, etc
▪   Hive allow M
              s ultiple on-disk format
Exam M
    ple ultiple on-disk Formats
▪   File Format:
    ▪   Row-based
    ▪   Column-based
    ▪   Block-based
▪   Rowformat:
    ▪   Text-based
    ▪   Binary-based
    ▪   Customized
▪   Index format
In-m ory Data Form
    em            at
▪   Single in-m ory form system
               em       at     s
    ▪   Simplicity: Simpler code
▪   Multiple in-m ory form system
                 em       at     s
    ▪   Ease-of-integration: other system m use their ow form
                                         s ay           n    at
    ▪   Performance:
        ▪   Multiple on-disk format/external form + efficient loading
                                                 at
            M   ultiple in-m ory form
                             em         at
▪   Hive allow M
              s ultiple in-m ory form
                            em       at
Exam M
    ple ultiple in-m ory Form
                    em       ats
▪   Integer:
    ▪   Integer
    ▪   IntWritable
    ▪   LazyInteger
▪   String:
    ▪   String
    ▪   Text
Multiple In-m ory Form Design Patterns
             em       at
▪   Object-oriented:
    ▪   A single interface/base class for Integer
    ▪   Multiple derived classes
▪   Delegation:
    ▪   data stored in object
    ▪   format/operations stored in objectInspector
    ▪   a pair of object and objectInspector represents a data unit
▪   It’ possible to w either one up to conform to the other’ pattern.
      s              rap                                   s
Multiple In-m ory Form Design Patterns
             em       at
▪   In OO, w need an interface HiveInteger to represent Integers
            e
    ▪   Make Integer, IntWritable classes all implem it.
                                                    ent
    ▪   How ever, Integer class is final (not extendable) and does not
        implem HiveInteger
              ent
    ▪   W need to do a conversion, every tim w exchange data w UDF,
           e                                     e e                  ith
        SerDe (Thrift), or other libraries (unless they knowHiveInteger –this
        is a bad assum  ption to m  ake in open system  ).
▪   Delegation w be a better idea because
                ill
    ▪   For Integer, w have an JavaIntegerObjectInspector
                      e
    ▪   For IntWritable , w have an W
                           e         ritableIntegerObjectInspector
    ▪   W convert param and return values only if necessary
         e             s
Delegation Method List
▪   General methods:                   ▪   List Objects:
    ▪   isNull(object o)                   ▪   getListSize(object o)
    ▪   hashCode(object o)                 ▪   getListElement(object o)
    ▪   compare(object o)                  ▪   getList(object o)
    ▪   clone(object o)                ▪   M Objects:
                                            ap
▪   Primitive Objects:                     ▪   getMapSize(object o)
    ▪   primitive getValue(object o)       ▪   getValueForKey(object o)

▪   String Objects:                        ▪   getMap(object o)

    ▪   String getString(object o)     ▪   Struct Objects:
    ▪   Text getText(object o)             ▪   getStructField(object o)
                                           ▪   getStructAsAList(object o)
SerDe
Where is SerDe?
                                                    Hive Operator                            Hive Operator        Re duc e r
            Mappe r


ObjectInspector

                  Hierarchical                Hierarchical    Hierarchical            Hierarchical    Hierarchical
                    Object                      Object           Object
                                                             Standard Object            Object           Object
                                                                                                     LazyObject
                                 Java Object                 Use ArrayList for struct and            Lazily-deserialized
                                 Object of a Java            array
SerDe                            Class                       Use HashM for m
                                                                        ap        ap
                                                     Text(‘ p 1.0 3 54’// UTF8
                                                          im            )
           Writable W ritable          W ritable     encoded W  ritable     W ritable                        Writable
                     BytesW   ritable(x3Fx64x72x0           W ritable    W  ritable
                     0)
FileForm / Hadoop Serialization
        at


        File on                                                              Map
                         thrift_record<… > Stream
                            Stream                       im 1.0 3 54
                                                            p                                                     File on
        HDFS                                                                Output
                         thrift_record<… >               Im 0.2 1 33
                                                            p                                                     HDFS
                                                                             File
                         thrift_record<… >               clk 2.2 8 212
                         thrift_record<… >               Im 0.7 2 22
                                                            p
                                   User Script
SerDe, ObjectInspector and TypeInfo
                              “
                              av”                                                             int            int

     String Object
                           Obje c tIns pe c to r3                        string      string         struct
                                                    getType

              g e tMapValue


     Hierarchical                                 getMapValueOI                      HashMap<String, String> a,
                           Obje c tIns pe c to r2
       Object                    HashM    ap(“  “ getType“ ),
                                                a” av”“  bv”
                                                        , b”                 map      int         list
                                                                                         class HO {
                                                                                                            string
                                                                                           HashM   ap<String, String> a,
              g e tS truc tFie ld                                                          Integer b,
                                        List (                                             List<ClassC> c,
                                          HashM   ap(“  “ , “  “ ),
                                                      a” av” b” bv”                        String d;
       Hierarchical                                 getFieldOI
                          Obje c tIns pe23, r1
                                          c to                                           }
          Object                                       getType                           Class ClassC {
                                                                                     Struct
                                        List(List(1,null),List(2,4),List(5,null)),         Integer a,
                                          “
                                          abcd”                                            Integer b;      Type Info
de s e rialize s e rialize        S e rDe
                                        )      getOI                                     }

Writable        Writable             Text(‘
                                          a=av:b=bv 23 1:2=4:5                          BytesWritable(x3Fx64x72x0
                                     abcd’)                                             0)
LazySimpleSerDe components
                                                     byte[](‘a=av:b=bv 23 1:2=4:5
                                     byte[] data     abcd’  )



               LazyStruct                                                 LazyStructOI(“ )
                                                                                        “




LazyMap        LazyInteger     LazyArray      LazyString        LazyMapOI(“ ,” )
                                                                          :” =“           LazyArrayOI(“ )
                                                                                                      :”

                                        LazyStruct
                                                                                 LazyStringOI
  LazyString         LazyString               LazyInteger
                                                                  LazyStringOI
  LazyString         LazyString               LazyInteger
                                                                                           LazyStructOI(“ )
                                                                                                        =“
                                        LazyStruct

  Hierarchical Object / LazyObject            LazyInteger           LazyIntegerOI            StandardIntegerOI
      One Per SerDe instance
                                              LazyInteger                    LazyObjectInspector
                                                                                 Singleton
LazyPrimitive
▪   LazyString/LazyInteger
    ▪   setAll(byte[] data, int start, int length)
        ▪   LazyString: parse the data and create a String object
        ▪   LazyInteger: parse the data and create an Integer object
    ▪   getObject() –returns the corresponding String/Integer object
▪   Future
    ▪   Replace String/Integer w Text/IntW
                                ith       ritable
    ▪   The Text/IntWritable object is owned by the LazyString/LazyInteger
        object.
LazyNonPrimitive
▪   LazyStruct/LazyArray/LazyMap
    ▪   setAll(byte[] data, int start, int length)
        ▪   Rem ber data, start and length, and set parsed to false.
               em
    ▪   getStructField/getArrayElement/getMapValue
        ▪   If not parsed yet, parse the byte and rem ber starting positions of
                                                     em
            each field/element/key/value
        ▪   For Struct/Array, do setAll on the corresponding LazyObject and
            return it
        ▪   For M search for the serialized key and return the corresponding
                 ap,
            value (after doing a setAll on the value).
W another SerDe?
 hy
▪   Functionality:
    ▪   MetadataTypedColumnSetSerDe can only deal w String colum
                                                   ith          ns
    ▪   Dynam icSerDe can deal w all prim
                                    ith       itive colum and prim
                                                         ns       itive lists/
        maps, but it does not fully support nested types yet.
▪   Efficiency:
    ▪   Both M  etadataTypedColum     nSetSerDe and Dynam  icSerDe uses
        String.split() and are not efficient for long rows
Features of LazySimpleSerDe
▪   Functionality:
    ▪   Fully compatible w M
                          ith etaDataSerDe and Dynamic/TCTLSeparated
    ▪   Fully support all nested types (M Key m be prim
                                         ap    ust     itive)
▪   Efficiency:
    ▪   Fully support lazy deserialization - only deserialize the field (and
        create Objects) w hen asked.
    ▪   Reuse multiple-levels of LazyObjects.
    ▪   Read numbers without UTF-8 decoding
    ▪   (TODO) Fully reuse objects - IntWritable for Integer, Text for String
    ▪   (TODO) W num
                rite bers without UTF-8 encoding
Profiling result of a mapper
▪   17%: TrackedRecordReader (should include InputFileFormat and decompression)
▪   22%: Operator.close
▪   |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)
▪   |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)
▪   50%: Operator.forward
▪   |-18%: Text.decode (from LazySerDe)
▪   | |- 7%: CharacterSet.decode() (UTF-8 decoding)
▪   | |- 5%: toString() (where we create the string object)
▪   |- 3%: LazyStruct.parse (the code that search for separators in the row)
▪   |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)
▪   |- 8%: GroupByOperator.processHashAggr
▪   |- 3%: HashMap.get() in GroupByOperator




▪   * Performance Data from Rodrigo Schmidt
TypeInfo String specification
▪   W not Thrift?
     hy
    ▪   Hard to parse
▪   Sim Syntax
       ple
    ▪   Type: PrimitiveType | MapType | ArrayType | StructType
    ▪   PrimitiveType: int | bigint | tinyint | smallint | double | string
    ▪   MapType: map<Type, Type>
    ▪   ArrayType: array<Type>
    ▪   StructType: struct< [Nam : Type]+ >
                                e
▪   Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
Future Works
Future Works of ObjectInspector
▪   Delegate all methods described earlier
    ▪   isNull(), hashCode(), compare() etc are not delegated yet
▪   Support UNION data type: HIVE-537
Future Works of SerDe
▪   LazyBinarySerDe: HIVE-553
    ▪   A binary-form sortable SerDe: serialized sorting order is the sam
                      at                                                 e
        as deserialized sorting order
    ▪   A binary-form com
                     at  pact SerDe: saving space

Mais conteúdo relacionado

Mais procurados

Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapperpsoo1978
 
OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011Antony T Curtis
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템HyeonSeok Choi
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash courseHolden Karau
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 

Mais procurados (19)

Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Pig
PigPig
Pig
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Python Objects
Python ObjectsPython Objects
Python Objects
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash course
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 

Semelhante a Efficient Java Object Model in Hive

Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUGCloudera, Inc.
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 
The InfoGrid Graph DataBase
The InfoGrid Graph DataBaseThe InfoGrid Graph DataBase
The InfoGrid Graph DataBaseInfoGrid.org
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Oop c++class(final).ppt
Oop c++class(final).pptOop c++class(final).ppt
Oop c++class(final).pptAlok Kumar
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
The design, architecture, and tradeoffs of FluidDB
The design, architecture, and tradeoffs of FluidDBThe design, architecture, and tradeoffs of FluidDB
The design, architecture, and tradeoffs of FluidDBTerry Jones
 
JVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationJVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationAzul Systems, Inc.
 
Session 14 - Object Class
Session 14 - Object ClassSession 14 - Object Class
Session 14 - Object ClassPawanMM
 

Semelhante a Efficient Java Object Model in Hive (20)

Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
 
Unit 3
Unit 3Unit 3
Unit 3
 
Ruby1_full
Ruby1_fullRuby1_full
Ruby1_full
 
Ruby1_full
Ruby1_fullRuby1_full
Ruby1_full
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
The InfoGrid Graph DataBase
The InfoGrid Graph DataBaseThe InfoGrid Graph DataBase
The InfoGrid Graph DataBase
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Oop c++class(final).ppt
Oop c++class(final).pptOop c++class(final).ppt
Oop c++class(final).ppt
 
מיכאל
מיכאלמיכאל
מיכאל
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
core java
core javacore java
core java
 
The design, architecture, and tradeoffs of FluidDB
The design, architecture, and tradeoffs of FluidDBThe design, architecture, and tradeoffs of FluidDB
The design, architecture, and tradeoffs of FluidDB
 
JVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationJVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentation
 
RaleighFS v5
RaleighFS v5RaleighFS v5
RaleighFS v5
 
Session 14 - Object Class
Session 14 - Object ClassSession 14 - Object Class
Session 14 - Object Class
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Efficient Java Object Model in Hive

  • 1. Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project
  • 3. On-disk Data Format ▪ Single on-disk form system at s ▪ Simplicity ▪ Multiple on-disk form system at s ▪ Ease-of-use ▪ Ease-of-integration ▪ Flexibility: better trade off between space, performance, etc ▪ Hive allow M s ultiple on-disk format
  • 4. Exam M ple ultiple on-disk Formats ▪ File Format: ▪ Row-based ▪ Column-based ▪ Block-based ▪ Rowformat: ▪ Text-based ▪ Binary-based ▪ Customized ▪ Index format
  • 5. In-m ory Data Form em at ▪ Single in-m ory form system em at s ▪ Simplicity: Simpler code ▪ Multiple in-m ory form system em at s ▪ Ease-of-integration: other system m use their ow form s ay n at ▪ Performance: ▪ Multiple on-disk format/external form + efficient loading at M ultiple in-m ory form em at ▪ Hive allow M s ultiple in-m ory form em at
  • 6. Exam M ple ultiple in-m ory Form em ats ▪ Integer: ▪ Integer ▪ IntWritable ▪ LazyInteger ▪ String: ▪ String ▪ Text
  • 7. Multiple In-m ory Form Design Patterns em at ▪ Object-oriented: ▪ A single interface/base class for Integer ▪ Multiple derived classes ▪ Delegation: ▪ data stored in object ▪ format/operations stored in objectInspector ▪ a pair of object and objectInspector represents a data unit ▪ It’ possible to w either one up to conform to the other’ pattern. s rap s
  • 8. Multiple In-m ory Form Design Patterns em at ▪ In OO, w need an interface HiveInteger to represent Integers e ▪ Make Integer, IntWritable classes all implem it. ent ▪ How ever, Integer class is final (not extendable) and does not implem HiveInteger ent ▪ W need to do a conversion, every tim w exchange data w UDF, e e e ith SerDe (Thrift), or other libraries (unless they knowHiveInteger –this is a bad assum ption to m ake in open system ). ▪ Delegation w be a better idea because ill ▪ For Integer, w have an JavaIntegerObjectInspector e ▪ For IntWritable , w have an W e ritableIntegerObjectInspector ▪ W convert param and return values only if necessary e s
  • 9. Delegation Method List ▪ General methods: ▪ List Objects: ▪ isNull(object o) ▪ getListSize(object o) ▪ hashCode(object o) ▪ getListElement(object o) ▪ compare(object o) ▪ getList(object o) ▪ clone(object o) ▪ M Objects: ap ▪ Primitive Objects: ▪ getMapSize(object o) ▪ primitive getValue(object o) ▪ getValueForKey(object o) ▪ String Objects: ▪ getMap(object o) ▪ String getString(object o) ▪ Struct Objects: ▪ Text getText(object o) ▪ getStructField(object o) ▪ getStructAsAList(object o)
  • 10. SerDe
  • 11. Where is SerDe? Hive Operator Hive Operator Re duc e r Mappe r ObjectInspector Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Object Object Object Standard Object Object Object LazyObject Java Object Use ArrayList for struct and Lazily-deserialized Object of a Java array SerDe Class Use HashM for m ap ap Text(‘ p 1.0 3 54’// UTF8 im ) Writable W ritable W ritable encoded W ritable W ritable Writable BytesW ritable(x3Fx64x72x0 W ritable W ritable 0) FileForm / Hadoop Serialization at File on Map thrift_record<… > Stream Stream im 1.0 3 54 p File on HDFS Output thrift_record<… > Im 0.2 1 33 p HDFS File thrift_record<… > clk 2.2 8 212 thrift_record<… > Im 0.7 2 22 p User Script
  • 12. SerDe, ObjectInspector and TypeInfo “ av” int int String Object Obje c tIns pe c to r3 string string struct getType g e tMapValue Hierarchical getMapValueOI HashMap<String, String> a, Obje c tIns pe c to r2 Object HashM ap(“  “ getType“ ), a” av”“  bv” , b” map int list class HO { string HashM ap<String, String> a, g e tS truc tFie ld Integer b, List ( List<ClassC> c, HashM ap(“  “ , “  “ ), a” av” b” bv” String d; Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC { Struct List(List(1,null),List(2,4),List(5,null)), Integer a, “ abcd” Integer b; Type Info de s e rialize s e rialize S e rDe ) getOI } Writable Writable Text(‘ a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0 abcd’) 0)
  • 13. LazySimpleSerDe components byte[](‘a=av:b=bv 23 1:2=4:5 byte[] data abcd’ ) LazyStruct LazyStructOI(“ ) “ LazyMap LazyInteger LazyArray LazyString LazyMapOI(“ ,” ) :” =“ LazyArrayOI(“ ) :” LazyStruct LazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyInteger LazyStructOI(“ ) =“ LazyStruct Hierarchical Object / LazyObject LazyInteger LazyIntegerOI StandardIntegerOI One Per SerDe instance LazyInteger LazyObjectInspector Singleton
  • 14. LazyPrimitive ▪ LazyString/LazyInteger ▪ setAll(byte[] data, int start, int length) ▪ LazyString: parse the data and create a String object ▪ LazyInteger: parse the data and create an Integer object ▪ getObject() –returns the corresponding String/Integer object ▪ Future ▪ Replace String/Integer w Text/IntW ith ritable ▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • 15. LazyNonPrimitive ▪ LazyStruct/LazyArray/LazyMap ▪ setAll(byte[] data, int start, int length) ▪ Rem ber data, start and length, and set parsed to false. em ▪ getStructField/getArrayElement/getMapValue ▪ If not parsed yet, parse the byte and rem ber starting positions of em each field/element/key/value ▪ For Struct/Array, do setAll on the corresponding LazyObject and return it ▪ For M search for the serialized key and return the corresponding ap, value (after doing a setAll on the value).
  • 16. W another SerDe? hy ▪ Functionality: ▪ MetadataTypedColumnSetSerDe can only deal w String colum ith ns ▪ Dynam icSerDe can deal w all prim ith itive colum and prim ns itive lists/ maps, but it does not fully support nested types yet. ▪ Efficiency: ▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses String.split() and are not efficient for long rows
  • 17. Features of LazySimpleSerDe ▪ Functionality: ▪ Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated ▪ Fully support all nested types (M Key m be prim ap ust itive) ▪ Efficiency: ▪ Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. ▪ Reuse multiple-levels of LazyObjects. ▪ Read numbers without UTF-8 decoding ▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String ▪ (TODO) W num rite bers without UTF-8 encoding
  • 18. Profiling result of a mapper ▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression) ▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) ▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) ▪ 50%: Operator.forward ▪ |-18%: Text.decode (from LazySerDe) ▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding) ▪ | |- 5%: toString() (where we create the string object) ▪ |- 3%: LazyStruct.parse (the code that search for separators in the row) ▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) ▪ |- 8%: GroupByOperator.processHashAggr ▪ |- 3%: HashMap.get() in GroupByOperator ▪ * Performance Data from Rodrigo Schmidt
  • 19. TypeInfo String specification ▪ W not Thrift? hy ▪ Hard to parse ▪ Sim Syntax ple ▪ Type: PrimitiveType | MapType | ArrayType | StructType ▪ PrimitiveType: int | bigint | tinyint | smallint | double | string ▪ MapType: map<Type, Type> ▪ ArrayType: array<Type> ▪ StructType: struct< [Nam : Type]+ > e ▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • 21. Future Works of ObjectInspector ▪ Delegate all methods described earlier ▪ isNull(), hashCode(), compare() etc are not delegated yet ▪ Support UNION data type: HIVE-537
  • 22. Future Works of SerDe ▪ LazyBinarySerDe: HIVE-553 ▪ A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order ▪ A binary-form com at pact SerDe: saving space