SlideShare a Scribd company logo
1 of 20
Download to read offline
Hive on Spark
Szehon Ho // Cloudera Software Engineer, Apache Hive PMC
2© 2014 Cloudera, Inc. All rights reserved.
Background	
  (Hive)	
  
•  Apache Hive: SQL-based data query and management tool for a
distributed dataset
•  Founded in 2007 at Facebook, most of our customers run Hive
jobs in production.
3© 2014 Cloudera, Inc. All rights reserved.
Background	
  (Hive)	
  
•  Inflexibility of MapReduce framework => Inefficient Hive
•  Map(), Reduce() primitives, not designed for long data pipelines
•  Complex SQL-like queries inefficiently expressed as many MR stages.
•  Disk IO between MR’s
•  Shuffle-sort between M+R
Map() Red()
Hive Query
Map() Red() Map() Red()
HDFS
4© 2014 Cloudera, Inc. All rights reserved.
Background	
  (Hive)	
  
•  2013 Hive Community started work on Hive on Tez
•  Tez DAG execution graph
Map() Red()
Hive Query
Map() Red()
Red()
HDFS
5© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)	
  
•  Generalized distributed processing framework created in ~2011 by
UC Berkeley AMPLab
•  Popular framework, heading to succeed MapReduce
6© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)
•  Clean	
  programming	
  abstrac:on:	
  Resilient	
  Distributed	
  Dataset	
  (RDD):	
  
•  A	
  fault-­‐tolerant	
  dataset,	
  can	
  be	
  a	
  stage	
  in	
  a	
  data	
  pipeline.	
  
•  Created	
  from	
  exis:ng	
  data	
  set	
  like	
  HDFS	
  file,	
  or	
  transforma:on	
  from	
  other	
  RDD	
  
(chain-­‐up	
  RDD’s)	
  
•  Expressive	
  API’s,	
  much	
  more	
  than	
  MapReduce	
  
•  Transforma:ons:	
  	
  map,	
  filter,	
  groupBy	
  
•  Ac:ons:	
  cache,	
  save	
  
•  =>	
  More	
  efficient	
  representa:on	
  of	
  Hive	
  queries	
  
7© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)	
  
•  Community Momentum:
•  Spark Summit 2014: Already the most active project in Hadoop ecosystem, top
3 most active Apache projects.
•  Since Spark 1.0 in June, two more biggest releases 1.1, 1.2
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
200
400
600
800
1000
1200
1400
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
Commits
 Lines of Code Changed
Activity in past 6 months
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
200
400
600
800
1000
1200
1400
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
Commits
 Lines of Code Changed
Activity in past 6 months
8© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)	
  
•  Community Momentum:
•  Advanced analytics, data science, ML, graph processing, etc.
•  Integration from with many Hadoop tools, ie Pig, Flume, Mahout, Crunch, Solr
•  Hive jobs can now leverage these Spark clusters as well
9© 2014 Cloudera, Inc. All rights reserved.
Hive on Spark
•  Shark	
  Project:	
  
•  AMPLab	
  github	
  project,	
  fork	
  of	
  Hive	
  
•  Not	
  maintained	
  by	
  Hive	
  community,	
  sunseUed	
  2014	
  
•  Hive	
  on	
  Spark:	
  
•  Done	
  in	
  Hive	
  community	
  
•  Architecturally	
  compa:ble,	
  by	
  keeping	
  same	
  physical	
  abstrac:on	
  for	
  Hive	
  on	
  
Spark	
  as	
  Hive	
  on	
  Tez/MR.	
  
•  Code	
  maintenance	
  
•  Maximize	
  re-­‐use	
  of	
  common	
  func:onality	
  across	
  execu:on	
  engine	
  
10© 2014 Cloudera, Inc. All rights reserved.
High-Level Design
10
Hive Query
Logical Op Tree
Task
TaskCompiler
Work
MapRedTask
MapWork
TezTask SparkTask
Common across engines:
•  HQL syntax
•  Tool Integrations (auditing plugins,
authorization, Drivers, Thrift clients, UDF,
StorageHandler)
•  Logical optimizations
ReduceWork
MapWork
ReduceWork
MapWork MapWk
RedWk
MapWk
SparkCompilerMapRedCompiler TezCompiler
11© 2014 Cloudera, Inc. All rights reserved.
Simple Example
11
SELECT COUNT(*) from status_updates
where ds = ‘2014-10-01’ group by region;
TableScan
(status_updates)
Filter (ds=‘2014 10-01’)
Select (region)
Group-By (count)
Select
Operator Tree:
Hive Query:
GBY trigger
reduce-boundary:
12© 2014 Cloudera, Inc. All rights reserved.
Simple Example
12
Reducer
GroupBy
Select
FileOutput
Mapper
TableScan
Filter
Select
Group-By
ReduceSink
MapRed Work Tree
•  Map->Reduce
ShuffleSort
13© 2014 Cloudera, Inc. All rights reserved.
Simple Example
13
mapPartition()
GroupBy
Select
FileOutput
mapPartition()
TableScan
Filter
Select
Group-By
ReduceSink
Spark Work Tree:
•  RDD Chain
groupBy()
No sorting
14© 2014 Cloudera, Inc. All rights reserved.
Join Example
TableScan
Filter
Select
Join
Select
Sort
Select
TableScan
Filter
Select
SELECT * FROM
(SELECT key FROM src WHERE src.key <
10) src1
JOIN
(SELECT key FROM src WHERE src.key <
10) src2
ON src1.key = src2.key
ORDER BY src1.key;
Hive Query:
15© 2014 Cloudera, Inc. All rights reserved.
Join Example
Map
ReduceSink
(Sort)
TableScan
Map
TableScan
Filter
Select
Reduce Sink Reduce
Join
Select
FileOutput
Reduce
FileOutput
Select
Map
TableScan
Filter
Select
Reduce Sink
HDFS
ShuffleSort ShuffleSort
Disk IO
MapRed Work Tree
•  2 MapReduce Works
16© 2014 Cloudera, Inc. All rights reserved.
Join Example
mapPartition()
Join
Select
Reduce Sink
mapPartition()
FileOutput
Select
union() Partition/
Sort()
sortBy()
No spill to disk
mapPartition()
TableScan
Filter
Select
Reduce Sink
mapPartition()
TableScan
Filter
Select
Reduce Sink
Spark Work Tree:
RDD Transform Chain
17© 2014 Cloudera, Inc. All rights reserved.
Demo
18© 2014 Cloudera, Inc. All rights reserved.
Improvements to Spark
•  Largest	
  MR	
  Java	
  app	
  ported	
  on	
  to	
  Spark,	
  can	
  serve	
  as	
  reference.	
  
•  Spark	
  Umbrella	
  JIRA	
  for	
  improvements	
  needed	
  by	
  Hive:	
  SPARK-­‐3145 	
  	
  
•  Implement	
  Java	
  version	
  of	
  Scala	
  API’s	
  (various),	
  shade	
  Spark	
  Guava	
  Library:	
  SPARK-­‐2848	
  
•  Monitoring	
  API’s	
  (SPARK-­‐2636,	
  various)	
  
•  Shuffle-­‐Sort	
  Transform:	
  SPARK-­‐2978	
  
•  Spark	
  had	
  group(),	
  sort(),	
  but	
  not	
  par::on+sort	
  like	
  MR-­‐style	
  shuffle-­‐sort.	
  
•  Elas:c	
  scaling	
  of	
  Spark	
  applica:on:	
  SPARK-­‐3174	
  
19© 2014 Cloudera, Inc. All rights reserved.
Community
•  Thanks	
  to	
  contributors	
  from	
  many	
  organiza:ons:	
  
•  Follow	
  our	
  progress	
  on	
  HIVE-­‐7292	
  
•  Thank	
  you!	
  
Thank you.

More Related Content

What's hot

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 

What's hot (20)

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 

Viewers also liked

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pubChao Zhu
 

Viewers also liked (10)

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 

Similar to Hive on Spark: An Efficient Way to Run SQL Queries

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 

Similar to Hive on Spark: An Efficient Way to Run SQL Queries (20)

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Yarn
YarnYarn
Yarn
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Data Science
Data ScienceData Science
Data Science
 
Hive paris
Hive parisHive paris
Hive paris
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 

More from trihug

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Rangertrihug
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentrytrihug
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihugtrihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shaintrihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gatestrihug
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 

More from trihug (11)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Practical pig
Practical pigPractical pig
Practical pig
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Hive on Spark: An Efficient Way to Run SQL Queries

  • 1. Hive on Spark Szehon Ho // Cloudera Software Engineer, Apache Hive PMC
  • 2. 2© 2014 Cloudera, Inc. All rights reserved. Background  (Hive)   •  Apache Hive: SQL-based data query and management tool for a distributed dataset •  Founded in 2007 at Facebook, most of our customers run Hive jobs in production.
  • 3. 3© 2014 Cloudera, Inc. All rights reserved. Background  (Hive)   •  Inflexibility of MapReduce framework => Inefficient Hive •  Map(), Reduce() primitives, not designed for long data pipelines •  Complex SQL-like queries inefficiently expressed as many MR stages. •  Disk IO between MR’s •  Shuffle-sort between M+R Map() Red() Hive Query Map() Red() Map() Red() HDFS
  • 4. 4© 2014 Cloudera, Inc. All rights reserved. Background  (Hive)   •  2013 Hive Community started work on Hive on Tez •  Tez DAG execution graph Map() Red() Hive Query Map() Red() Red() HDFS
  • 5. 5© 2014 Cloudera, Inc. All rights reserved. Background (Spark)   •  Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab •  Popular framework, heading to succeed MapReduce
  • 6. 6© 2014 Cloudera, Inc. All rights reserved. Background (Spark) •  Clean  programming  abstrac:on:  Resilient  Distributed  Dataset  (RDD):   •  A  fault-­‐tolerant  dataset,  can  be  a  stage  in  a  data  pipeline.   •  Created  from  exis:ng  data  set  like  HDFS  file,  or  transforma:on  from  other  RDD   (chain-­‐up  RDD’s)   •  Expressive  API’s,  much  more  than  MapReduce   •  Transforma:ons:    map,  filter,  groupBy   •  Ac:ons:  cache,  save   •  =>  More  efficient  representa:on  of  Hive  queries  
  • 7. 7© 2014 Cloudera, Inc. All rights reserved. Background (Spark)   •  Community Momentum: •  Spark Summit 2014: Already the most active project in Hadoop ecosystem, top 3 most active Apache projects. •  Since Spark 1.0 in June, two more biggest releases 1.1, 1.2 Compared to Other Projects MapReduce YARN HDFS Storm Spark 0 200 400 600 800 1000 1200 1400 MapReduce YARN HDFS Storm Spark 0 50000 100000 150000 200000 250000 300000 Commits Lines of Code Changed Activity in past 6 months Compared to Other Projects MapReduce YARN HDFS Storm Spark 0 200 400 600 800 1000 1200 1400 MapReduce YARN HDFS Storm Spark 0 50000 100000 150000 200000 250000 300000 Commits Lines of Code Changed Activity in past 6 months
  • 8. 8© 2014 Cloudera, Inc. All rights reserved. Background (Spark)   •  Community Momentum: •  Advanced analytics, data science, ML, graph processing, etc. •  Integration from with many Hadoop tools, ie Pig, Flume, Mahout, Crunch, Solr •  Hive jobs can now leverage these Spark clusters as well
  • 9. 9© 2014 Cloudera, Inc. All rights reserved. Hive on Spark •  Shark  Project:   •  AMPLab  github  project,  fork  of  Hive   •  Not  maintained  by  Hive  community,  sunseUed  2014   •  Hive  on  Spark:   •  Done  in  Hive  community   •  Architecturally  compa:ble,  by  keeping  same  physical  abstrac:on  for  Hive  on   Spark  as  Hive  on  Tez/MR.   •  Code  maintenance   •  Maximize  re-­‐use  of  common  func:onality  across  execu:on  engine  
  • 10. 10© 2014 Cloudera, Inc. All rights reserved. High-Level Design 10 Hive Query Logical Op Tree Task TaskCompiler Work MapRedTask MapWork TezTask SparkTask Common across engines: •  HQL syntax •  Tool Integrations (auditing plugins, authorization, Drivers, Thrift clients, UDF, StorageHandler) •  Logical optimizations ReduceWork MapWork ReduceWork MapWork MapWk RedWk MapWk SparkCompilerMapRedCompiler TezCompiler
  • 11. 11© 2014 Cloudera, Inc. All rights reserved. Simple Example 11 SELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region; TableScan (status_updates) Filter (ds=‘2014 10-01’) Select (region) Group-By (count) Select Operator Tree: Hive Query: GBY trigger reduce-boundary:
  • 12. 12© 2014 Cloudera, Inc. All rights reserved. Simple Example 12 Reducer GroupBy Select FileOutput Mapper TableScan Filter Select Group-By ReduceSink MapRed Work Tree •  Map->Reduce ShuffleSort
  • 13. 13© 2014 Cloudera, Inc. All rights reserved. Simple Example 13 mapPartition() GroupBy Select FileOutput mapPartition() TableScan Filter Select Group-By ReduceSink Spark Work Tree: •  RDD Chain groupBy() No sorting
  • 14. 14© 2014 Cloudera, Inc. All rights reserved. Join Example TableScan Filter Select Join Select Sort Select TableScan Filter Select SELECT * FROM (SELECT key FROM src WHERE src.key < 10) src1 JOIN (SELECT key FROM src WHERE src.key < 10) src2 ON src1.key = src2.key ORDER BY src1.key; Hive Query:
  • 15. 15© 2014 Cloudera, Inc. All rights reserved. Join Example Map ReduceSink (Sort) TableScan Map TableScan Filter Select Reduce Sink Reduce Join Select FileOutput Reduce FileOutput Select Map TableScan Filter Select Reduce Sink HDFS ShuffleSort ShuffleSort Disk IO MapRed Work Tree •  2 MapReduce Works
  • 16. 16© 2014 Cloudera, Inc. All rights reserved. Join Example mapPartition() Join Select Reduce Sink mapPartition() FileOutput Select union() Partition/ Sort() sortBy() No spill to disk mapPartition() TableScan Filter Select Reduce Sink mapPartition() TableScan Filter Select Reduce Sink Spark Work Tree: RDD Transform Chain
  • 17. 17© 2014 Cloudera, Inc. All rights reserved. Demo
  • 18. 18© 2014 Cloudera, Inc. All rights reserved. Improvements to Spark •  Largest  MR  Java  app  ported  on  to  Spark,  can  serve  as  reference.   •  Spark  Umbrella  JIRA  for  improvements  needed  by  Hive:  SPARK-­‐3145     •  Implement  Java  version  of  Scala  API’s  (various),  shade  Spark  Guava  Library:  SPARK-­‐2848   •  Monitoring  API’s  (SPARK-­‐2636,  various)   •  Shuffle-­‐Sort  Transform:  SPARK-­‐2978   •  Spark  had  group(),  sort(),  but  not  par::on+sort  like  MR-­‐style  shuffle-­‐sort.   •  Elas:c  scaling  of  Spark  applica:on:  SPARK-­‐3174  
  • 19. 19© 2014 Cloudera, Inc. All rights reserved. Community •  Thanks  to  contributors  from  many  organiza:ons:   •  Follow  our  progress  on  HIVE-­‐7292   •  Thank  you!