This document discusses the efficient object model in Java used by Apache Hive. It covers object inspectors, on-disk and in-memory data formats, delegation patterns for multiple in-memory formats, the role of SerDes, and optimizations for the LazySimpleSerDe. Future work areas include fully delegating object inspector methods and supporting union data types.
3. On-disk Data Format
▪ Single on-disk form system
at s
▪ Simplicity
▪ Multiple on-disk form system
at s
▪ Ease-of-use
▪ Ease-of-integration
▪ Flexibility: better trade off between space, performance, etc
▪ Hive allow M
s ultiple on-disk format
4. Exam M
ple ultiple on-disk Formats
▪ File Format:
▪ Row-based
▪ Column-based
▪ Block-based
▪ Rowformat:
▪ Text-based
▪ Binary-based
▪ Customized
▪ Index format
5. In-m ory Data Form
em at
▪ Single in-m ory form system
em at s
▪ Simplicity: Simpler code
▪ Multiple in-m ory form system
em at s
▪ Ease-of-integration: other system m use their ow form
s ay n at
▪ Performance:
▪ Multiple on-disk format/external form + efficient loading
at
M ultiple in-m ory form
em at
▪ Hive allow M
s ultiple in-m ory form
em at
6. Exam M
ple ultiple in-m ory Form
em ats
▪ Integer:
▪ Integer
▪ IntWritable
▪ LazyInteger
▪ String:
▪ String
▪ Text
7. Multiple In-m ory Form Design Patterns
em at
▪ Object-oriented:
▪ A single interface/base class for Integer
▪ Multiple derived classes
▪ Delegation:
▪ data stored in object
▪ format/operations stored in objectInspector
▪ a pair of object and objectInspector represents a data unit
▪ It’ possible to w either one up to conform to the other’ pattern.
s rap s
8. Multiple In-m ory Form Design Patterns
em at
▪ In OO, w need an interface HiveInteger to represent Integers
e
▪ Make Integer, IntWritable classes all implem it.
ent
▪ How ever, Integer class is final (not extendable) and does not
implem HiveInteger
ent
▪ W need to do a conversion, every tim w exchange data w UDF,
e e e ith
SerDe (Thrift), or other libraries (unless they knowHiveInteger –this
is a bad assum ption to m ake in open system ).
▪ Delegation w be a better idea because
ill
▪ For Integer, w have an JavaIntegerObjectInspector
e
▪ For IntWritable , w have an W
e ritableIntegerObjectInspector
▪ W convert param and return values only if necessary
e s
11. Where is SerDe?
Hive Operator Hive Operator Re duc e r
Mappe r
ObjectInspector
Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical
Object Object Object
Standard Object Object Object
LazyObject
Java Object Use ArrayList for struct and Lazily-deserialized
Object of a Java array
SerDe Class Use HashM for m
ap ap
Text(‘ p 1.0 3 54’// UTF8
im )
Writable W ritable W ritable encoded W ritable W ritable Writable
BytesW ritable(x3Fx64x72x0 W ritable W ritable
0)
FileForm / Hadoop Serialization
at
File on Map
thrift_record<… > Stream
Stream im 1.0 3 54
p File on
HDFS Output
thrift_record<… > Im 0.2 1 33
p HDFS
File
thrift_record<… > clk 2.2 8 212
thrift_record<… > Im 0.7 2 22
p
User Script
12. SerDe, ObjectInspector and TypeInfo
“
av” int int
String Object
Obje c tIns pe c to r3 string string struct
getType
g e tMapValue
Hierarchical getMapValueOI HashMap<String, String> a,
Obje c tIns pe c to r2
Object HashM ap(“ “ getType“ ),
a” av”“ bv”
, b” map int list
class HO {
string
HashM ap<String, String> a,
g e tS truc tFie ld Integer b,
List ( List<ClassC> c,
HashM ap(“ “ , “ “ ),
a” av” b” bv” String d;
Hierarchical getFieldOI
Obje c tIns pe23, r1
c to }
Object getType Class ClassC {
Struct
List(List(1,null),List(2,4),List(5,null)), Integer a,
“
abcd” Integer b; Type Info
de s e rialize s e rialize S e rDe
) getOI }
Writable Writable Text(‘
a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0
abcd’) 0)
14. LazyPrimitive
▪ LazyString/LazyInteger
▪ setAll(byte[] data, int start, int length)
▪ LazyString: parse the data and create a String object
▪ LazyInteger: parse the data and create an Integer object
▪ getObject() –returns the corresponding String/Integer object
▪ Future
▪ Replace String/Integer w Text/IntW
ith ritable
▪ The Text/IntWritable object is owned by the LazyString/LazyInteger
object.
15. LazyNonPrimitive
▪ LazyStruct/LazyArray/LazyMap
▪ setAll(byte[] data, int start, int length)
▪ Rem ber data, start and length, and set parsed to false.
em
▪ getStructField/getArrayElement/getMapValue
▪ If not parsed yet, parse the byte and rem ber starting positions of
em
each field/element/key/value
▪ For Struct/Array, do setAll on the corresponding LazyObject and
return it
▪ For M search for the serialized key and return the corresponding
ap,
value (after doing a setAll on the value).
16. W another SerDe?
hy
▪ Functionality:
▪ MetadataTypedColumnSetSerDe can only deal w String colum
ith ns
▪ Dynam icSerDe can deal w all prim
ith itive colum and prim
ns itive lists/
maps, but it does not fully support nested types yet.
▪ Efficiency:
▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses
String.split() and are not efficient for long rows
17. Features of LazySimpleSerDe
▪ Functionality:
▪ Fully compatible w M
ith etaDataSerDe and Dynamic/TCTLSeparated
▪ Fully support all nested types (M Key m be prim
ap ust itive)
▪ Efficiency:
▪ Fully support lazy deserialization - only deserialize the field (and
create Objects) w hen asked.
▪ Reuse multiple-levels of LazyObjects.
▪ Read numbers without UTF-8 decoding
▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String
▪ (TODO) W num
rite bers without UTF-8 encoding
18. Profiling result of a mapper
▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression)
▪ 22%: Operator.close
▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)
▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)
▪ 50%: Operator.forward
▪ |-18%: Text.decode (from LazySerDe)
▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding)
▪ | |- 5%: toString() (where we create the string object)
▪ |- 3%: LazyStruct.parse (the code that search for separators in the row)
▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)
▪ |- 8%: GroupByOperator.processHashAggr
▪ |- 3%: HashMap.get() in GroupByOperator
▪ * Performance Data from Rodrigo Schmidt
19. TypeInfo String specification
▪ W not Thrift?
hy
▪ Hard to parse
▪ Sim Syntax
ple
▪ Type: PrimitiveType | MapType | ArrayType | StructType
▪ PrimitiveType: int | bigint | tinyint | smallint | double | string
▪ MapType: map<Type, Type>
▪ ArrayType: array<Type>
▪ StructType: struct< [Nam : Type]+ >
e
▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
21. Future Works of ObjectInspector
▪ Delegate all methods described earlier
▪ isNull(), hashCode(), compare() etc are not delegated yet
▪ Support UNION data type: HIVE-537
22. Future Works of SerDe
▪ LazyBinarySerDe: HIVE-553
▪ A binary-form sortable SerDe: serialized sorting order is the sam
at e
as deserialized sorting order
▪ A binary-form com
at pact SerDe: saving space