Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
1. H104: Harnessing the Hadoop Ecosystem
Optimizations in Apache Hive
Jason Huang, Senior Solutions Architect – Qubole, Inc.
May 12, 2015
NYC Data Summit Hadoop Day
2. A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project.
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed,
Norwest Ventures.
2015 CNBC Disruptor 50 Companies – announced today!
World class product and engineering team from:
3. Hive – SQL on Hadoop
● A system for managing and querying unstructured data as if it were
structured
● Uses Map-Reduce for execution
● HDFS for Storage (or Amazon S3)
● Key Building Principles
● SQL as a familiar data warehousing tool
● Extensibility (Pluggable map/reduce scripts in the language of your
choice, Rich and User Defined Data Types, User Defined Functions)
● Interoperability (Extensible Framework to support different file and data
formats)
● Performance
4. Why Hive?
● Problem : Unlimited data
● Terabytes everyday
● Wide Adoption of Hadoop
● Scalable/Available
● But, Hadoop can be …
● Complex
● Different Paradigm
● Map-Reduce hard to program
5. Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST
API
(HTTPS)
SSH
Ephemeral
Hadoop
Clusters,
Managed
by
Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS
–
Qubole
User,
Account
ConfiguraFons
(Encrypted
credenFals
Amazon S3
w/S3 Server Side
Encryption
Default
Hive
Metastore
Encryption Options:
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)(b)
(a)
(opFonal)
Custom
Hive
Metastore
SSH
6. De-normalizing data:
Normalization:
- models data tables with certain rules to deal with redundancy
- normalizing creates multiple relational tables
- requires joins at runtime to produce results
Joins are expensive and difficult operations to perform and are one of the
common reasons for performance issues. Because of this, it’s a good idea
to avoid highly normalized table structures because they require join
queries to derive the desired metrics.
7. Partitioning Tables:
Hive partitioning is an effective method to improve the query performance
on larger tables. Partition key is best as a low cardinal attribute.
9. Bucketing:
- improves the join performance if the bucket key and join keys are
common
- distributes the data in different buckets based on the hash results on
the bucket key
- Reduces I/O scans during the join process if the process is happening
on the same keys (columns)
Note: set bucketing flag (hive.enforce.bucketing) each time before
writing data to the bucketed table.
To leverage the bucketing in the join operation we should set
hive.optimize.bucketmapjoin=true. This setting hints to Hive to do
bucket level join during the map stage join.
11. File Input Formats:
- play a critical role in Hive performance
E.g. JSON, the text type of input formats
- not a good choice for a large production system where data volume is
really high
- readable format take a lot of space and have some overhead of
parsing ( e.g JSON parsing )
To address these problems, Hive comes with columnar input formats like
RCFile, ORC etc. Columnar formats reduce read operations in queries by
allowing each column to be accessed individually.
Other binary formats like Avro, sequence files, Thrift can be effective in
various use cases.
12. Compress map/reduce output:
- reduce the intermediate data volume
- reduces the amount of data transfers between mappers and reducers
over the network
Note: gzip compressed files are not splittable – so apply with caution
File size should not be larger than a few hundred megabytes
- otherwise it can potentially lead to an imbalanced job
- compression codec options: e.g. snappy, lzo, bzip, etc.
For map output compression: set mapred.compress.map.output=“true”
For job output compression: set mapred.output.compress=“true”
13. Parallel execution:
Hadoop can execute MapReduce jobs in parallel, and several queries
executed on Hive automatically use this parallelism.
14. Vectorization:
- allows Hive to process a batch of rows in ORC format together instead
of processing one row at a time
Each batch consists of a column vector which is usually an array of
primitive types. Operations are performed on the entire column vector,
which improves the instruction pipelines and cache usage.
To enable: set hive.vectorized.execution.enabled=true
15. Sampling:
- allows users to take a subset of dataset and analyze it, without having
to analyze the entire data set
Hive offers a built-in TABLESAMPLE clause that allows you to sample
your tables.
TABLESAMPLE can sample at various granularity levels
- return only subsets of buckets (bucket sampling)
- HDFS blocks (block sampling)
- first N records from each input split
Alternatively, you can implement your own UDF that filters out records
according to your sampling algorithm.
17. Unit Testing:
- In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries
and more.
- Verify the correctness of your whole HiveQL query without touching a
Hadoop cluster.
- Executing HiveQL query in the local mode takes literally seconds,
compared to minutes, hours or days if it runs in the Hadoop mode.
Various tools available: e.g HiveRunner, Hive_test and Beetest.
21. 21
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
22. 22
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Aug-13
Sept-13
Oct-13
Nov-13
Dec-13
Jan-14
Feb-14
Number of queries
Segment audiences based on their behavior including such
topics as user pathway and multi-dimensional recency
analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies