O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
1© Cloudera, Inc. All rights reserved.
Faster Batch Processing with
Hive-on-Spark
Santosh Kumar | Cloudera
Rui Li | Intel
2© Cloudera, Inc. All rights reserved.
Agenda
• What is Hive-on-Spark?
• Using Hive-on-Spark
• Performance Metrics
• Confi...
3© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexib...
4© Cloudera, Inc. All rights reserved.
Spark Takes Advantage of Memory
• Resilient Distributed Datasets (RDD)
• In-memory ...
5© Cloudera, Inc. All rights reserved.
Introduction
• Enables Hive to use Spark as underlying execution engine
• Motivatio...
6© Cloudera, Inc. All rights reserved.
Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
Batch
Processi...
7© Cloudera, Inc. All rights reserved.
Current State of Hive-on-Spark (HoS)
• Fully supported production release in C5.7
•...
8© Cloudera, Inc. All rights reserved.
Design Principles
• Minimize impact on existing code path
• Minimizes functional an...
9© Cloudera, Inc. All rights reserved.
Getting Started with Hive-on-Spark
10© Cloudera, Inc. All rights reserved.
Configuration
• Minimal configurations needed
• Via Cloudera Manager: Set “Spark o...
11© Cloudera, Inc. All rights reserved.
Performance
Avg. ~3X faster than Hive-on-MapReduce
More Suitable Less Suitable
Com...
12© Cloudera, Inc. All rights reserved.
Query Execution: Background
Input
status_updates( userid int,status string,ds stri...
13© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
14© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
15© Cloudera, Inc. All rights reserved.
Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS
FileSinkOperator (disk ...
16© Cloudera, Inc. All rights reserved.
Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS
17© Cloudera, Inc. All rights reserved.
Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS
18© Cloudera, Inc. All rights reserved.
Optimization for Resource Management:
Long-Live Executors (LLE)
• MR: Each query a...
19© Cloudera, Inc. All rights reserved.
Long-Lived Executors Details
• Hive User Session will submit Spark Application to ...
20© Cloudera, Inc. All rights reserved.
Configuration and Tuning
Hive-on-Spark
21© Cloudera, Inc. All rights reserved.
Spark Configuration
• Size of executors
• Bigger and fewer executors
• Threads con...
22© Cloudera, Inc. All rights reserved.
Spark Configuration
• CPU
• Around 5-7 cores per executor
• Memory
• Leave 10% for...
23© Cloudera, Inc. All rights reserved.
Spark Configuration
• Serialization
• spark.serializer – kryo performs better and ...
24© Cloudera, Inc. All rights reserved.
Partitioning
• Number of mappers
• Inputformat
• mapreduce.input.fileinputformat.s...
25© Cloudera, Inc. All rights reserved.
Hive Configuration
• General optimizations
• Enable vectorization
• Enable CBO
• M...
26© Cloudera, Inc. All rights reserved.
Hive Configuration
• Map join
• hive.auto.convert.join.noconditionaltask.size
• Ho...
27© Cloudera, Inc. All rights reserved.
Resource Allocation
• Static allocation
• spark.executor.instances
• Won’t release...
28© Cloudera, Inc. All rights reserved.
Resource Allocation
• Pre-warm containers
• hive.prewarm.enabled
• spark.scheduler...
29© Cloudera, Inc. All rights reserved.
Configuration and Tuning Summary
• Number and size of executors most important det...
30© Cloudera, Inc. All rights reserved.
Roadmap
• Additional Optimizations
• Dynamic Partition Pruning
• Vectorization sup...
31© Cloudera, Inc. All rights reserved.
More Information & Next Steps
Get Started
• Download C5.7: www.cloudera.com/downlo...
32© Cloudera, Inc. All rights reserved.
Questions?
Próximos SlideShares
Carregando em…5
×

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

2.910 visualizações

Publicada em

It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark.

With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end.

Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss:
An Introduction to Spark and its advantages over MR
An introduction of Hive-on-Spark: Goals and Design Principles
Migrating to HoS and a live demo
Configuring and tuning for batch workloads
What’s next for both tools

Publicada em: Software
  • Seja o primeiro a comentar

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

  1. 1. 1© Cloudera, Inc. All rights reserved. Faster Batch Processing with Hive-on-Spark Santosh Kumar | Cloudera Rui Li | Intel
  2. 2. 2© Cloudera, Inc. All rights reserved. Agenda • What is Hive-on-Spark? • Using Hive-on-Spark • Performance Metrics • Configuration & Tuning • What’s Next? • Q&A
  3. 3. 3© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  4. 4. 4© Cloudera, Inc. All rights reserved. Spark Takes Advantage of Memory • Resilient Distributed Datasets (RDD) • In-memory data-structure partitioned across a set of machines • Can fall back to disk when data-set does not fit in memory • Created by parallel transformations on data in stable storage • Provides fault-tolerance through concept of lineage
  5. 5. 5© Cloudera, Inc. All rights reserved. Introduction • Enables Hive to use Spark as underlying execution engine • Motivations • Consolidation of Spark as execution engine • Better performance • Increased adoption of Hive (e.g. for Spark users) • Community effort by Cloudera, IBM, Intel, MapR, and others
  6. 6. 6© Cloudera, Inc. All rights reserved. Choosing the Right SQL Engine Know Your Audience, Know Your Use Case Batch Processing BI and SQL Analytics Procedural Development SQLOR Impala
  7. 7. 7© Cloudera, Inc. All rights reserved. Current State of Hive-on-Spark (HoS) • Fully supported production release in C5.7 • Functional parity with Hive-on-MapReduce (HoMR) • Average 3x performance improvement vs HoMR • Automatic configuration and optimizations via Cloudera Manager • Strong early user base • Early commitment for future collaboration from Intel and others
  8. 8. 8© Cloudera, Inc. All rights reserved. Design Principles • Minimize impact on existing code path • Minimizes functional and performance impact • Minimizes maintenance • Maximizes support for Hive features – current as well as future • Spark invoked only at execution layer • HoS produces similar logical operators plan as HoMR • Logical plan runs on low-level Spark primitives • Minimizes usage of advanced Spark primitives
  9. 9. 9© Cloudera, Inc. All rights reserved. Getting Started with Hive-on-Spark
  10. 10. 10© Cloudera, Inc. All rights reserved. Configuration • Minimal configurations needed • Via Cloudera Manager: Set “Spark on YARN Service” (internally sets spark.master=yarn-cluster) • Set hive.execution.engine=spark per service or query • Only yarn-cluster is supported • Cloudera Manager auto-configures most configurations • Configuration & Tuning Guide available on Docs
  11. 11. 11© Cloudera, Inc. All rights reserved. Performance Avg. ~3X faster than Hive-on-MapReduce More Suitable Less Suitable Complex workloads w/ multiple MR stages e.g. filter followed by JOIN followed by GROUP BY Simple workloads e.g. select * Disk-bound w/ multiple disk reads/writes CPU bound workloads e.g. complex UDFs Workloads requiring mins to hours for completion Workloads typically requiring <1 min
  12. 12. 12© Cloudera, Inc. All rights reserved. Query Execution: Background Input status_updates( userid int,status string,ds string) profiles(userid int,school string,gender int) Output school_summary(school string,cnt int,ds string) gender_summary(gender int,cnt int,ds string)
  13. 13. 13© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS
  14. 14. 14© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS
  15. 15. 15© Cloudera, Inc. All rights reserved. Query Execution: MapReduce BEGINS CONTINUES CONTINUES ENDS FileSinkOperator (disk write) and TableScanOperator (disk read) are very costly
  16. 16. 16© Cloudera, Inc. All rights reserved. Query Execution: Hive-on-Spark Costly Steps Removed BEGINS CONTINUES CONTINUES ENDS
  17. 17. 17© Cloudera, Inc. All rights reserved. Query Execution: Hive-on-Spark Costly Steps Removed BEGINS CONTINUES CONTINUES ENDS
  18. 18. 18© Cloudera, Inc. All rights reserved. Optimization for Resource Management: Long-Live Executors (LLE) • MR: Each query an independent YARN application • Spark: Each SQL session is a long-lived YARN application • First query of a session spawns a YARN app • Subsequent queries re-use same YARN app as well as containers • Session disconnect shuts down YARN app and releases container resources
  19. 19. 19© Cloudera, Inc. All rights reserved. Long-Lived Executors Details • Hive User Session will submit Spark Application to YARN • Spark YARN Application: • YARN container = Spark Executors live in YARN containers • YARN Application Master = RemoteDriver • Submits Spark ‘jobs’, aka Hive queries, to Spark executors • Connects back to HS2 to report job progress from Spark executors User1 User2 HiveServer2 Session1 Session2 YARN Cluster AM (RemoteDriver1) Containers (Executors) AM (RemoteDriver2) Containers (Executors)
  20. 20. 20© Cloudera, Inc. All rights reserved. Configuration and Tuning Hive-on-Spark
  21. 21. 21© Cloudera, Inc. All rights reserved. Spark Configuration • Size of executors • Bigger and fewer executors • Threads contention • GC pressure • Smaller and more executors • Less memory efficient • Bigger start-up overhead
  22. 22. 22© Cloudera, Inc. All rights reserved. Spark Configuration • CPU • Around 5-7 cores per executor • Memory • Leave 10% for OS cache • Executor memory overhead • Tune by case • Can be heavily used by Netty • Usually 15% - 20% • Around 3GB per core
  23. 23. 23© Cloudera, Inc. All rights reserved. Spark Configuration • Serialization • spark.serializer – kryo performs better and is REQUIRED by HoS • spark.kryo.referenceTracking – disable to avoid java performance issue • Shuffle • spark.shuffle.compress • spark.shuffle.spill.compress • Trade CPU for I/O • Increase number of reducers
  24. 24. 24© Cloudera, Inc. All rights reserved. Partitioning • Number of mappers • Inputformat • mapreduce.input.fileinputformat.split.maxsize • Number of reducers • hive.exec.reducers.bytes.per.reducer • mapreduce.job.reduces • HoS tends to launch more reducers • Merge small files • hive.merge.sparkfiles
  25. 25. 25© Cloudera, Inc. All rights reserved. Hive Configuration • General optimizations • Enable vectorization • Enable CBO • Map join auto convertion • Map side aggregation • Etc.
  26. 26. 26© Cloudera, Inc. All rights reserved. Hive Configuration • Map join • hive.auto.convert.join.noconditionaltask.size • HoS doesn’t support conditional map join yet • HoS uses raw data size as small table size – different from MR • hive.stats.collect.rawdatasize • Skew join • Compile time – same as MR • Runtime - HoS will split the original task at join
  27. 27. 27© Cloudera, Inc. All rights reserved. Resource Allocation • Static allocation • spark.executor.instances • Won’t release until session is closed • Recommended for benchmarking • Dynamic allocation • spark.dynamicAllocation.enabled • spark.executor.dynamicAllocation.initialExecutors • spark.executor.dynamicAllocation.minExecutors • spark.executor.dynamicAllocation.maxExecutors • Number of executors per Spark application scales up and down • Suited for multi-tenancy scenarios (multi-session)
  28. 28. 28© Cloudera, Inc. All rights reserved. Resource Allocation • Pre-warm containers • hive.prewarm.enabled • spark.scheduler.maxRegisteredResourcesWaitingTime • spark.scheduler.minRegisteredResourcesRatio • Attempt for better parallelism • Considerable delay for start-up job • Not recommended for short-lived sessions
  29. 29. 29© Cloudera, Inc. All rights reserved. Configuration and Tuning Summary • Number and size of executors most important determinants of performance • Resolve query performance/failures by allocating more executors with more CPU and RAM • spark.executor.instances, spark.executor.cores, spark.executor.memory, spark.yarn.executor.memoryOverhead • Cloudera Manager takes care of most of the optimizations • Most Hive config settings applicable to HoS, but few have different semantics • See Config and Tuning Guide for details
  30. 30. 30© Cloudera, Inc. All rights reserved. Roadmap • Additional Optimizations • Dynamic Partition Pruning • Vectorization support • Cost-Based Optimizer • Others – Caching RDDs across queries, Optimize self join/union etc. • Supportability Enhancements • Better support for debugging and logging • More informative stage description in WebUI • Others: Improve Hue integration, additional metrics specific to HoS etc. • Rebase to Spark 2.0 and Parquet 1.8
  31. 31. 31© Cloudera, Inc. All rights reserved. More Information & Next Steps Get Started • Download C5.7: www.cloudera.com/downloads Release Notes • www.cloudera.com/documentation/enterprise/latest/topics/rg_release_ notes.html Training Classes • university.cloudera.com
  32. 32. 32© Cloudera, Inc. All rights reserved. Questions?

×