Hive on Spark: An Efficient Way to Run SQL Queries

Hive on Spark
Szehon Ho // Cloudera Software Engineer, Apache Hive PMC

2© 2014 Cloudera, Inc. All rights reserved.
Background
(Hive)

•  Apache Hive: SQL-based data query and management tool for a
distributed dataset
•  Founded in 2007 at Facebook, most of our customers run Hive
jobs in production.

Background
(Hive)

•  Inflexibility of MapReduce framework => Inefficient Hive
•  Map(), Reduce() primitives, not designed for long data pipelines
•  Complex SQL-like queries inefficiently expressed as many MR stages.
•  Disk IO between MR’s
•  Shuffle-sort between M+R
Map() Red()
Hive Query
Map() Red() Map() Red()
HDFS

Background
(Hive)

•  2013 Hive Community started work on Hive on Tez
•  Tez DAG execution graph
Map() Red()
Hive Query
Map() Red()
Red()
HDFS

Background (Spark)

•  Generalized distributed processing framework created in ~2011 by
UC Berkeley AMPLab
•  Popular framework, heading to succeed MapReduce

Background (Spark)
•  Clean
programming
abstrac:on:
Resilient
Distributed
Dataset
(RDD):

•  A
fault-‐tolerant
dataset,
can
be
a
stage
in
a
data
pipeline.

•  Created
from
exis:ng
data
set
like
HDFS
file,
or
transforma:on
from
other
RDD

(chain-‐up
RDD’s)

•  Expressive
API’s,
much
more
than
MapReduce

•  Transforma:ons:

map,
filter,
groupBy

•  Ac:ons:
cache,
save

•  =>
More
efficient
representa:on
of
Hive
queries

Background (Spark)

•  Community Momentum:
•  Spark Summit 2014: Already the most active project in Hadoop ecosystem, top
3 most active Apache projects.
•  Since Spark 1.0 in June, two more biggest releases 1.1, 1.2
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
200
400
600
800
1000
1200
1400
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
Commits
Lines of Code Changed
Activity in past 6 months
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
200
400
600
800
1000
1200
1400
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
Commits
Lines of Code Changed
Activity in past 6 months

Background (Spark)

•  Community Momentum:
•  Advanced analytics, data science, ML, graph processing, etc.
•  Integration from with many Hadoop tools, ie Pig, Flume, Mahout, Crunch, Solr
•  Hive jobs can now leverage these Spark clusters as well

Hive on Spark
•  Shark
Project:

•  AMPLab
github
project,
fork
of
Hive

•  Not
maintained
by
Hive
community,
sunseUed
2014

•  Hive
on
Spark:

•  Done
in
Hive
community

•  Architecturally
compa:ble,
by
keeping
same
physical
abstrac:on
for
Hive
on

Spark
as
Hive
on
Tez/MR.

•  Code
maintenance

•  Maximize
re-‐use
of
common
func:onality
across
execu:on
engine

High-Level Design
10
Hive Query
Logical Op Tree
Task
TaskCompiler
Work
MapRedTask
MapWork
TezTask SparkTask
Common across engines:
•  HQL syntax
•  Tool Integrations (auditing plugins,
authorization, Drivers, Thrift clients, UDF,
StorageHandler)
•  Logical optimizations
ReduceWork
MapWork
ReduceWork
MapWork MapWk
RedWk
MapWk
SparkCompilerMapRedCompiler TezCompiler

Simple Example
11
SELECT COUNT(*) from status_updates
where ds = ‘2014-10-01’ group by region;
TableScan
(status_updates)
Filter (ds=‘2014 10-01’)
Select (region)
Group-By (count)
Select
Operator Tree:
Hive Query:
GBY trigger
reduce-boundary:

Simple Example
12
Reducer
GroupBy
Select
FileOutput
Mapper
TableScan
Filter
Select
Group-By
ReduceSink
MapRed Work Tree
•  Map->Reduce
ShuffleSort

Simple Example
13
mapPartition()
GroupBy
Select
FileOutput
mapPartition()
TableScan
Filter
Select
Group-By
ReduceSink
Spark Work Tree:
•  RDD Chain
groupBy()
No sorting

Join Example
TableScan
Filter
Select
Join
Select
Sort
Select
TableScan
Filter
Select
SELECT * FROM
(SELECT key FROM src WHERE src.key <
10) src1
JOIN
(SELECT key FROM src WHERE src.key <
10) src2
ON src1.key = src2.key
ORDER BY src1.key;
Hive Query:

Join Example
Map
ReduceSink
(Sort)
TableScan
Map
TableScan
Filter
Select
Reduce Sink Reduce
Join
Select
FileOutput
Reduce
FileOutput
Select
Map
TableScan
Filter
Select
Reduce Sink
HDFS
ShuffleSort ShuffleSort
Disk IO
MapRed Work Tree
•  2 MapReduce Works

Join Example
mapPartition()
Join
Select
Reduce Sink
mapPartition()
FileOutput
Select
union() Partition/
Sort()
sortBy()
No spill to disk
mapPartition()
TableScan
Filter
Select
Reduce Sink
mapPartition()
TableScan
Filter
Select
Reduce Sink
Spark Work Tree:
RDD Transform Chain

Demo

Improvements to Spark
•  Largest
MR
Java
app
ported
on
to
Spark,
can
serve
as
reference.

•  Spark
Umbrella
JIRA
for
improvements
needed
by
Hive:
SPARK-‐3145

•  Implement
Java
version
of
Scala
API’s
(various),
shade
Spark
Guava
Library:
SPARK-‐2848

•  Monitoring
API’s
(SPARK-‐2636,
various)

•  Shuﬄe-‐Sort
Transform:
SPARK-‐2978

•  Spark
had
group(),
sort(),
but
not
par::on+sort
like
MR-‐style
shuﬄe-‐sort.

•  Elas:c
scaling
of
Spark
applica:on:
SPARK-‐3174

Community
•  Thanks
to
contributors
from
many
organiza:ons:

•  Follow
our
progress
on
HIVE-‐7292

•  Thank
you!

Hive on Spark: An Efficient Way to Run SQL Queries

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Hive on Spark: An Efficient Way to Run SQL Queries

Similar to Hive on Spark: An Efficient Way to Run SQL Queries (20)

More from trihug

More from trihug (11)

Recently uploaded

Recently uploaded (20)

Hive on Spark: An Efficient Way to Run SQL Queries