SlideShare a Scribd company logo
1 of 38
1© Cloudera, Inc. All rights reserved.
Hive Now Sparks
Rui Li
Intel
Hive Committer
Xuefu Zhang
Cloudera
Hive PMC
2© Cloudera, Inc. All rights reserved.
Agenda

Background

Architecture and Design

Challenges

Current Status

Benchmarks

Q&A
3© Cloudera, Inc. All rights reserved.
Background

Apache Hive: popular SQL-based data processing tool in Hadoop with
large user base

Apache Spark: generalized distributed processing framework to succeed
MapReduce

Marrying the two benefits users in both communities

HIVE-7292, the most watched (160+)
4© Cloudera, Inc. All rights reserved.
Motivation

Succeed MapReduce as the generalized batch processing engine

Introduce faster batch processing in Hive

Greater adoption for both projects (Apache Hive and Apache Spark)
5© Cloudera, Inc. All rights reserved.
Community Involvement

Effort from both communities (Hive and Spark)

Contributions from many organizations

Free lancer contributors, early adopters
6© Cloudera, Inc. All rights reserved.
Design Principles

No or limited impact on Hive's existing code path

Maximal code reuse

Minimum feature customization

Low future maintenance cost
7© Cloudera, Inc. All rights reserved.
Hive Internal
Hive Client
(Beeline, JDBC)
Parser Semantic
Analyzer
Task CompilerHadoop Client
Hadoop
HQL AST
Operator
Tree
TasksResult
HiveServer2
MetaStore
8© Cloudera, Inc. All rights reserved.
Class Hierarchy
MapRedCompiler
TaskCompiler
TezCompiler MapRedTask
Task
TezTask MapRedWork
Work
TezWork
generates described by
9© Cloudera, Inc. All rights reserved.
Class Hierarchy
MapRedCompiler
TaskCompiler
TezCompiler
SparkCompiler
MapRedTask
Task
TezTask
SparkTask
MapRedWork
Work
TezWork
SparkWork
generates described by
10© Cloudera, Inc. All rights reserved.
Work – Metadata for Task

MapRedWork contains one MapWork and a possible ReduceWork

SparkWork contains a graph of MapWorks and ReduceWorks
MapWork1
ReduceWork1
MapWork2
ReduceWork2
MapWork1
ReduceWork1
ReduceWork2
Ex. Query:
select name, sum(value)
as v
from dec
group by name
order by v;
Spark
Job
MR
job 1
MR
job 2
11© Cloudera, Inc. All rights reserved.
Spark Client

Abreast with MR client, Tez client

Talking to Spark cluster

Support local, local-cluster, standalone, yarn-cluster, yarn-client

Job submission, monitoring, error reporting, statistics, metrics, counters
12© Cloudera, Inc. All rights reserved.
Spark Context

Core of Spark client

Heavy-weighted, thread-unsafe

Designed for a single-user application

Doesn't work in multi-session environment

Doesn't scale with user sessions
13© Cloudera, Inc. All rights reserved.
Remote Spark Context (RSC)

Being created and living outside HiveServer2

In yarn-cluster mode, Spark context lives in application master (AM)

Otherwise, Spark context lives in a separate process (other than HS2)
User 1
User 2
Session 1
Session2
HiveServer 2
AM (RSC)
Node 2
Node 3 Yarn
cluster
Node 1
AM (RSC)
14© Cloudera, Inc. All rights reserved.
Data Processing via MapReduce

Table as HDFS files is read by MR framework

Map-side processing

Map output is shuffled by MR framework

Reduce-side processing

Reduce output is written to disk as part of reduce-side processing

Output may be further processed by next MR job or returned to client
15© Cloudera, Inc. All rights reserved.
Data Processing via Spark

Treat Table as HadoopRDD (input RDD)

Apply the function that wraps MR's map-side processing

Shuffle map output using Spark's transformations (groupByKey,
sortByKey, etc)

Apply the function that wraps MR's reduce-side processing

Output is either written to file or being shuffled again
16© Cloudera, Inc. All rights reserved.
Spark Plan

MapInput – encapsulate a table

MapTran – map-side processing

ShuffleTran – shuffling

ReduceTran – reduce-side processing
MapInput MapTran ShuffleTran
ShuffleTran ReduceTranReduceTran
Query: select name, sum(value)
as v from dec group by name
order by v;
17© Cloudera, Inc. All rights reserved.
Advantages

Reuse existing map-side and reduce-side processing

Agnostic to Spark's special transformations or actions

No need to reinvent wheels

Adopt existing features: authorization, window functions, UDFs, etc

Open to future features
18© Cloudera, Inc. All rights reserved.
Challenges

Missing features or functional gaps in Spark
Concurrent data pipelines
Spark Context issues
Scala vs Java API
Scheduling issues

Large code base in Hive, many contributors working in different areas

Library dependency conflicts among projects
19© Cloudera, Inc. All rights reserved.
Dynamic Executor Scaling

Spark cluster per user session

Heavy user vs light user

Big query vs small query

Solution: executors up and down based on load
20© Cloudera, Inc. All rights reserved.
Current Status

All functionality is implemented

First round of optimization is completed

More optimization and benchmarking

Beta release in CDH in Feb. 2015

Released in Apache Hive 1.1

Included in CDH5.4 (coming soon) with support for selected customers

Follow HIVE-7292 for current and future work
21© Cloudera, Inc. All rights reserved.
Optimizations

Map join (bucket map join, SMB, skew join (static and dynamic))

Split generation and grouping

CBO, vectorization

More to come, including table caching, dynamic partition pruning
22© Cloudera, Inc. All rights reserved.
Summary

Community driven project

Multi-Organization support

Combining merits from multiple projects

Benefiting a large user base

Bearing solid foundations

A fast-moving, evolving project
23© Cloudera, Inc. All rights reserved.
BenchMarks – Cluster Setup

8 physical nodes

Each has 32 logical cores and 64GB memory

10Gb/s network between the nodes
Component Version
Hive Spark-branch
Spark 1.3.0
Hadoop 2.6.0
Tez 0.5.3
24© Cloudera, Inc. All rights reserved.
BenchMarks – Test Configurations

320GB and 4TB TPC-DS datasets

Three engines share the most configurations
– Each node is allocated 32 cores and 48GB memory
– Vectorization enabled
– CBO enabled
– hive.auto.convert.join.noconditionaltask.size = 600MB
25© Cloudera, Inc. All rights reserved.
BenchMarks – Test Configurations
Hive on Spark
•
spark.master = yarn-client
•
spark.executor.memory = 5120m
•
spark.yarn.executor.memoryOverhead = 1024
•
spark.executor.cores = 4
•
spark.kryo.referenceTracking = false
•
spark.io.compression.codec = lzf
Hive on Tez
•
hive.prewarm.numcontainers = 250
•
hive.tez.auto.reducer.parallelism = true
•
hive.tez.dynamic.partition.pruning = true
26© Cloudera, Inc. All rights reserved.
BenchMarks – Data Collecting

We run each query 2 times and measure the 2nd run

Spark on yarn waits for a number of executors to register before
scheduling tasks, thus with a bigger start-up overhead

We also measure a few queries for Tez with dynamic partition pruning
disabled for fair comparison, as this optimization hasn't been
implemented in Hive on Spark yet
27© Cloudera, Inc. All rights reserved.
MR vs Spark vs Tez, 320GB
28© Cloudera, Inc. All rights reserved.
MR vs Spark vs Tez, 320GB
DPP helps Tez
29© Cloudera, Inc. All rights reserved.

Prune partitions at runtime

Can dramatically improve performance when tables are joined on
partitioned columns

To be implemented for Hive on Spark
Benchmarks - Dynamic Partition Pruning
30© Cloudera, Inc. All rights reserved.
Spark vs Tez vs Tez w/o DPP, 320GB
31© Cloudera, Inc. All rights reserved.
MR vs Spark vs Tez, 4TB
32© Cloudera, Inc. All rights reserved.
Spark vs Tez, 4TB
33© Cloudera, Inc. All rights reserved.
Spark vs Tez, 4TB
DPP helps Tez
34© Cloudera, Inc. All rights reserved.
Spark vs Tez vs Tez w/o DPP, 4TB
35© Cloudera, Inc. All rights reserved.
Spark vs Tez, 4TB
Spark is faster
36© Cloudera, Inc. All rights reserved.
Spark vs Tez, 4TB
Tez is faster
37© Cloudera, Inc. All rights reserved.
BenchMark - Summary

Spark is as fast as or faster than Tez on many queries (Q28, Q82, Q88,
etc.)

Dynamic partition pruning helps Tez a lot on certain queries (Q3, Q15,
Q19). Without DPP for Tez, Spark is close or even faster.

Spark is slower on certain queries (common join, Q84) than Tez.
Investigation and improvement is on the way.

Bigger dataset seems helping Spark more.

Spark will likely be faster after DPP is implemented.
38© Cloudera, Inc. All rights reserved.
Q&A

More Related Content

What's hot

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 

What's hot (20)

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 

Viewers also liked

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

Viewers also liked (20)

October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望
 
reference letter christof
reference letter christofreference letter christof
reference letter christof
 
Apache Kite
Apache KiteApache Kite
Apache Kite
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Similar to Hive Now Sparks

BigData Clusters Redefined
BigData Clusters RedefinedBigData Clusters Redefined
BigData Clusters Redefined
DataWorks Summit
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 

Similar to Hive Now Sparks (20)

Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
BigData Clusters Redefined
BigData Clusters RedefinedBigData Clusters Redefined
BigData Clusters Redefined
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)
 
Spark 101
Spark 101Spark 101
Spark 101
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Hive Now Sparks

  • 1. 1© Cloudera, Inc. All rights reserved. Hive Now Sparks Rui Li Intel Hive Committer Xuefu Zhang Cloudera Hive PMC
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda  Background  Architecture and Design  Challenges  Current Status  Benchmarks  Q&A
  • 3. 3© Cloudera, Inc. All rights reserved. Background  Apache Hive: popular SQL-based data processing tool in Hadoop with large user base  Apache Spark: generalized distributed processing framework to succeed MapReduce  Marrying the two benefits users in both communities  HIVE-7292, the most watched (160+)
  • 4. 4© Cloudera, Inc. All rights reserved. Motivation  Succeed MapReduce as the generalized batch processing engine  Introduce faster batch processing in Hive  Greater adoption for both projects (Apache Hive and Apache Spark)
  • 5. 5© Cloudera, Inc. All rights reserved. Community Involvement  Effort from both communities (Hive and Spark)  Contributions from many organizations  Free lancer contributors, early adopters
  • 6. 6© Cloudera, Inc. All rights reserved. Design Principles  No or limited impact on Hive's existing code path  Maximal code reuse  Minimum feature customization  Low future maintenance cost
  • 7. 7© Cloudera, Inc. All rights reserved. Hive Internal Hive Client (Beeline, JDBC) Parser Semantic Analyzer Task CompilerHadoop Client Hadoop HQL AST Operator Tree TasksResult HiveServer2 MetaStore
  • 8. 8© Cloudera, Inc. All rights reserved. Class Hierarchy MapRedCompiler TaskCompiler TezCompiler MapRedTask Task TezTask MapRedWork Work TezWork generates described by
  • 9. 9© Cloudera, Inc. All rights reserved. Class Hierarchy MapRedCompiler TaskCompiler TezCompiler SparkCompiler MapRedTask Task TezTask SparkTask MapRedWork Work TezWork SparkWork generates described by
  • 10. 10© Cloudera, Inc. All rights reserved. Work – Metadata for Task  MapRedWork contains one MapWork and a possible ReduceWork  SparkWork contains a graph of MapWorks and ReduceWorks MapWork1 ReduceWork1 MapWork2 ReduceWork2 MapWork1 ReduceWork1 ReduceWork2 Ex. Query: select name, sum(value) as v from dec group by name order by v; Spark Job MR job 1 MR job 2
  • 11. 11© Cloudera, Inc. All rights reserved. Spark Client  Abreast with MR client, Tez client  Talking to Spark cluster  Support local, local-cluster, standalone, yarn-cluster, yarn-client  Job submission, monitoring, error reporting, statistics, metrics, counters
  • 12. 12© Cloudera, Inc. All rights reserved. Spark Context  Core of Spark client  Heavy-weighted, thread-unsafe  Designed for a single-user application  Doesn't work in multi-session environment  Doesn't scale with user sessions
  • 13. 13© Cloudera, Inc. All rights reserved. Remote Spark Context (RSC)  Being created and living outside HiveServer2  In yarn-cluster mode, Spark context lives in application master (AM)  Otherwise, Spark context lives in a separate process (other than HS2) User 1 User 2 Session 1 Session2 HiveServer 2 AM (RSC) Node 2 Node 3 Yarn cluster Node 1 AM (RSC)
  • 14. 14© Cloudera, Inc. All rights reserved. Data Processing via MapReduce  Table as HDFS files is read by MR framework  Map-side processing  Map output is shuffled by MR framework  Reduce-side processing  Reduce output is written to disk as part of reduce-side processing  Output may be further processed by next MR job or returned to client
  • 15. 15© Cloudera, Inc. All rights reserved. Data Processing via Spark  Treat Table as HadoopRDD (input RDD)  Apply the function that wraps MR's map-side processing  Shuffle map output using Spark's transformations (groupByKey, sortByKey, etc)  Apply the function that wraps MR's reduce-side processing  Output is either written to file or being shuffled again
  • 16. 16© Cloudera, Inc. All rights reserved. Spark Plan  MapInput – encapsulate a table  MapTran – map-side processing  ShuffleTran – shuffling  ReduceTran – reduce-side processing MapInput MapTran ShuffleTran ShuffleTran ReduceTranReduceTran Query: select name, sum(value) as v from dec group by name order by v;
  • 17. 17© Cloudera, Inc. All rights reserved. Advantages  Reuse existing map-side and reduce-side processing  Agnostic to Spark's special transformations or actions  No need to reinvent wheels  Adopt existing features: authorization, window functions, UDFs, etc  Open to future features
  • 18. 18© Cloudera, Inc. All rights reserved. Challenges  Missing features or functional gaps in Spark Concurrent data pipelines Spark Context issues Scala vs Java API Scheduling issues  Large code base in Hive, many contributors working in different areas  Library dependency conflicts among projects
  • 19. 19© Cloudera, Inc. All rights reserved. Dynamic Executor Scaling  Spark cluster per user session  Heavy user vs light user  Big query vs small query  Solution: executors up and down based on load
  • 20. 20© Cloudera, Inc. All rights reserved. Current Status  All functionality is implemented  First round of optimization is completed  More optimization and benchmarking  Beta release in CDH in Feb. 2015  Released in Apache Hive 1.1  Included in CDH5.4 (coming soon) with support for selected customers  Follow HIVE-7292 for current and future work
  • 21. 21© Cloudera, Inc. All rights reserved. Optimizations  Map join (bucket map join, SMB, skew join (static and dynamic))  Split generation and grouping  CBO, vectorization  More to come, including table caching, dynamic partition pruning
  • 22. 22© Cloudera, Inc. All rights reserved. Summary  Community driven project  Multi-Organization support  Combining merits from multiple projects  Benefiting a large user base  Bearing solid foundations  A fast-moving, evolving project
  • 23. 23© Cloudera, Inc. All rights reserved. BenchMarks – Cluster Setup  8 physical nodes  Each has 32 logical cores and 64GB memory  10Gb/s network between the nodes Component Version Hive Spark-branch Spark 1.3.0 Hadoop 2.6.0 Tez 0.5.3
  • 24. 24© Cloudera, Inc. All rights reserved. BenchMarks – Test Configurations  320GB and 4TB TPC-DS datasets  Three engines share the most configurations – Each node is allocated 32 cores and 48GB memory – Vectorization enabled – CBO enabled – hive.auto.convert.join.noconditionaltask.size = 600MB
  • 25. 25© Cloudera, Inc. All rights reserved. BenchMarks – Test Configurations Hive on Spark • spark.master = yarn-client • spark.executor.memory = 5120m • spark.yarn.executor.memoryOverhead = 1024 • spark.executor.cores = 4 • spark.kryo.referenceTracking = false • spark.io.compression.codec = lzf Hive on Tez • hive.prewarm.numcontainers = 250 • hive.tez.auto.reducer.parallelism = true • hive.tez.dynamic.partition.pruning = true
  • 26. 26© Cloudera, Inc. All rights reserved. BenchMarks – Data Collecting  We run each query 2 times and measure the 2nd run  Spark on yarn waits for a number of executors to register before scheduling tasks, thus with a bigger start-up overhead  We also measure a few queries for Tez with dynamic partition pruning disabled for fair comparison, as this optimization hasn't been implemented in Hive on Spark yet
  • 27. 27© Cloudera, Inc. All rights reserved. MR vs Spark vs Tez, 320GB
  • 28. 28© Cloudera, Inc. All rights reserved. MR vs Spark vs Tez, 320GB DPP helps Tez
  • 29. 29© Cloudera, Inc. All rights reserved.  Prune partitions at runtime  Can dramatically improve performance when tables are joined on partitioned columns  To be implemented for Hive on Spark Benchmarks - Dynamic Partition Pruning
  • 30. 30© Cloudera, Inc. All rights reserved. Spark vs Tez vs Tez w/o DPP, 320GB
  • 31. 31© Cloudera, Inc. All rights reserved. MR vs Spark vs Tez, 4TB
  • 32. 32© Cloudera, Inc. All rights reserved. Spark vs Tez, 4TB
  • 33. 33© Cloudera, Inc. All rights reserved. Spark vs Tez, 4TB DPP helps Tez
  • 34. 34© Cloudera, Inc. All rights reserved. Spark vs Tez vs Tez w/o DPP, 4TB
  • 35. 35© Cloudera, Inc. All rights reserved. Spark vs Tez, 4TB Spark is faster
  • 36. 36© Cloudera, Inc. All rights reserved. Spark vs Tez, 4TB Tez is faster
  • 37. 37© Cloudera, Inc. All rights reserved. BenchMark - Summary  Spark is as fast as or faster than Tez on many queries (Q28, Q82, Q88, etc.)  Dynamic partition pruning helps Tez a lot on certain queries (Q3, Q15, Q19). Without DPP for Tez, Spark is close or even faster.  Spark is slower on certain queries (common join, Q84) than Tez. Investigation and improvement is on the way.  Bigger dataset seems helping Spark more.  Spark will likely be faster after DPP is implemented.
  • 38. 38© Cloudera, Inc. All rights reserved. Q&A