SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Spark on YARN
Best practices
Adarsh Pannu
IBM Analytics Platform
DRAFT: This is work in progress. Please send comments to adarshrp@us.ibm.com
Spark and Cluster Management
Spark supports four different cluster managers:
●  Local: Useful only for development
●  Standalone: Bundled with Spark, doesn’t play well with other applications, fine for PoCs
●  YARN: Highly recommended for production
●  Mesos: Not supported in BigInsights
Each mode has a similar “logical” architecture although physical details differ in terms of which/where
processes and threads are launched.
Spark Cluster Architecture: Logical View
Driver runs the main() function of the application. This can run outside (“client”) or
inside the cluster (“cluster”)
SparkContext is the main entry point for Spark functionality. Represents the
connection to a Spark cluster.
Executor is a JVM that runs tasks and keeps data in memory or disk storage across
them. Each application has its own executors spread across a cluster.
Task
Driver Program
SparkContext
Cluster Manager
Executor
Cache
Task
Spark: What’s Inside an Executor?
Task
Task
Task
RDD P1
RDD P2
RDD P3
Internal
Threads
Single JVM
Partitions from 2
different RDDs being
processed by 3 tasks
RDD P2
RDD P1
Cached RDD
partitions from yet
another RDD
Shuffle, Transport,
GC, and other system
threads
Free Task Slots (“Cores”)
Executor
Spark: Standalone Cluster Manager
Worker
(Client 1) (Client 2)
Worker
(Client 1)
Client 2
Executor Executor
Machine 1 Machine 2
•  Inter-process communication not shown.
•  All orange boxes are JVMs
•  Deploy mode = “Client”
Executor
Master
Client 1
Standalone Mode: Configuration
Per Worker Node Per Application Per Executor
CPU SPARK_WORKER_CORES spark.cores.max
Memory SPARK_WORKER_MEMORY spark.executor.memory
SPARK_WORKER_CORES # of cores to give to underlying Executors (default: all available cores).
SPARK_WORKER_MEMORY Total memory to use on the machine, e.g. 1000m, 2g
(default: total memory minus 1 GB)
spark.cores.max maximum # of cores to request for the application
across the cluster (default: all available cores)
spark.executor.memory Memory per executor (default: 512m)
Standalone mode uses as FIFO scheduler. As applications launch, it will try to balance the resource
consumption across the cluster. Strangely, cores are specified per application, yet memory is per
executor!
Spark on YARN: Architecture
Resource Manager
Node Manager
Container Container
Node Manager
Container
Client
Spark
Application
Master
Executor Executor
Machine 1 Machine 2
Machine 0 •  Inter-process communication
not shown.
•  All orange boxes are JVMs
Spark Configuration
Spark has scores of configuration options:
•  For many options, defaults generally work alright
•  However, there are some critical “knobs” that should be carefully tuned
Several settings are cluster manager specific. When running Spark on YARN, you must examine:
•  Yarn-specific settings: scheduler type and queues
•  Spark specific settings for YARN: # of executors, per-executor memory and cores, and more
Other general techniques will improve your applications on any cluster manager. For example:
•  Java object serialization schemes (Kryo vs Java)
•  Proper partitioning and parallelism levels
•  On-disk data formats (Parquet vs AVRO vs JSON vs ...)
•  And many more ... (to be covered elsewhere)
Spark on YARN: Managing queues
Your cluster may serve different applications/users, each with differing expectations:
•  Batch jobs could possibly wait but interactive users may not
•  Tight SLAs need to be honored often at the expense of others
There may be more than one instance of the same type of application, and yet, they may need to be
treated differently. E.g. different Spark jobs may have differing needs.
Step 1: Divide up your cluster resources into “queues” that are organized by target needs:
•  Choose scheduling strategy: Capacity vs. Fair.
•  Capacity scheduler is best for applications that need guarantees on availability of cluster resources
(although at the cost of elasticity)
•  Fair scheduler is best for applications that want to share resources in some pre-determined
proportions.
•  (This aspect is not covered in this document as it’s adequately documented elsewhere)
Step 2: Configure resources for Spark jobs based on the queue capacities.
•  Described in the next slide
Step 3: In your Spark application code, designate the right via –queue or by setting“spark.yarn.queue”
Spark on YARN: Basic Configuration
YARN Settings (Per Node, not Per Queue) Spark Settings (Per
Executor)
Executor
Count
--num-executors OR
spark.executor.instances
CPU yarn.nodemanager.resource.memory-mb --executor-cores OR
spark.executor.cores
Memory yarn.nodemanager.resource.cpu-vcores --executor-memory OR
spark.executor.memory
Spark internally adds an overhead to spark.executor.memory to account for off-heap JVM usage:
overhead = MIN(384 MB, 10% of spark.executor.memory) // As of Spark 1.4
Yarn further adjusts requested container size:
1.  Ensures memory is a multiple of yarn.scheduler.minimum-allocation-mb. Unlike its name, this
isn’t merely a minimum bound. CAUTION: Setting yarn.scheduler.minimum-allocation-mb too
high can over-allocate memory because of rounding up.
2.  Ensures request size is bounded by yarn.scheduler.maximum-allocation-mb
Need to
specify
these
Spark on YARN: Memory Usage Inside an Executor
App
Objects
Shuffle
Cache
spark.shuffle.memoryFraction
Default = 0.2 (20%)
Used for shuffles. Increase this
for shuffle-intensive applications
wherein spills happen often.
spark.storage.memoryFraction
Default = 0.6 (60%)
Used for cached RDDs, useful
if .cache() or .persist() called.
This is the memory for
application objects. It is what is
left after setting the other two. If
you’re seeing OOMs in your
code, you need more memory
here!
Guideline: Stick with defaults, and check execution statistics to tweak settings.
May need to tweak
Executor memory
breakdowns too:
Spark on YARN: Sizing up Executors
How many Executors? How many cores? How much memory?
Setting spark.executor.memory
!  Size up this number first
•  Don’t use excessively large executors as GC pauses become a problem.
•  Don’t use overly skinny executors since JVM overhead becomes proportionately higher
•  10GB <= spark.executor.memory <= 48GB could be a good guideline?
•  Choose towards the higher end when working with bigger data partitions, using large broadcast
variables, etc.
Setting spark.executor.instances
!  Given spark.executor.memory, compute spark.executor.instances to saturate available memory.
!  In reality, spark.executor.memory and spark.executor.instances are computed hand-in-hand.
!  Don’t forget to account for overheads (daemons, application master, driver, etc.)
•  spark.executor.instances ~ #nodes * (yarn.nodemanager.resource.memory-mb * queue-fraction /
spark.executor.memory)
Setting spark.executor.cores
•  Over-request cores by 2 to 3 times the number of actual cores in your cluster.
•  Why? Not all tasks are CPU bound at the same time.
Spark on YARN: Sizing up Executors (Example)
Sample Cluster Configuration:
8 nodes, 32 cores/node (256 total), 128 GB/node (1024 GB total)
Running YARN Capacity Scheduler
Spark queue has 50% of the cluster resources
Naive Configuration:
spark.executor.instances = 8 (one Executor per node)
spark.executor.cores = 32 * 0.5 = 16 => Undersubscribed
spark.executor.memory = 64 MB => GC pauses
Better Configuration:
spark.executor.memory = 16 GB (just as an example)
spark.executor.instances = 8 * (128 GB * 0.5 / 16 GB) = 32 total
spark.executor.cores = total-available-cores * over-subscription-factor / spark.executor.instances
= (256 * 0.5) * 2.5 / 32 = 10
These calculations aren’t perfect -- they don’t account for overheads, for the Application Master
container, etc. But hopefully you get the idea ☺
Different applications dictate different settings. EXPERIMENT and FINE TUNE!
Spark on YARN: Exploiting Data Locality
•  Spark tries to execute tasks on nodes such that there will be minimal data movement (data locality)
!  Loss of data locality = suboptimal performance
•  These tasks are run on executors, which are (usually) launched when a SparkContext is spawned,
and well before Spark knows what data will be “touched.”
•  Problem: How does Spark tell YARN where to launch Executors?
•  Your application can tell Spark the list of nodes that hold data (“preferred locations”). Using a simple
API, you can supply this information when instantiating a SparkContext
•  See SparkContext constructor (argument preferredNodeLocationData)
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
val hdfspath = “hdfs://...”
val sc = new SparkContext(sparkConf,
InputFormatInfo.computePreferredLocations(
Seq(new InputFormatInfo(conf,
classOf[org.apache.hadoop.mapred.TextInputFormat],
hdfspath ))
Spark on YARN: Dynamic Allocation
•  Prior to Release 1.3, Spark acquired all executors at application startup and held onto them for the
lifetime of an application.
•  Starting Release 1.3, Spark supports “dynamic allocation” of executors. This allows applications to
launch executors when more tasks are queued up, and release resources when the application is
idle.
•  Ideally suited for many interactive applications that might have see user down-time.
•  Major caveat: Spark may release executors with cached RDDs! Ouch! So if you’re application uses
rdd.cache() or rdd.persist() to materialize expensive computations, you may not want to use dynamic
allocation for that application.
•  On the other hand, you could consider “caching” expensive computations in HDFS.
Spark on YARN: Dynamic Allocation settings
Configuration option Default Description
spark.dynamicAllocation.enabled false Set to true to get elasticity
spark.dynamicAllocation.minExecutors 0 Lower bound on # executors.
Leave as is.
spark.dynamicAllocation.maxExecutors <Infinity> Upper bound on # executors. Set
based on worksheet in previous
slide.
spark.dynamicAllocation.executorIdleTimeout 600 secs
(10 mins)
How long to wait before giving up
idle executors? Set to lower value,
say 1 minute?
spark.dynamicAllocation.schedulerBacklogTim
eout
spark.dynamicAllocation.sustainedSchedulerB
acklogTimeout
5 secs How to launch new executors to
meet incoming demand?
Executors are launched in waves
of exponentially increasing
numbers. Leave as is.

Mais conteúdo relacionado

Mais procurados

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 

Mais procurados (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 

Destaque

Destaque (8)

Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
RDD
RDDRDD
RDD
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 

Semelhante a Spark on YARN

Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersAnant Corporation
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 

Semelhante a Spark on YARN (20)

Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource Managers
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

Último

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 

Último (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 

Spark on YARN

  • 1. Spark on YARN Best practices Adarsh Pannu IBM Analytics Platform DRAFT: This is work in progress. Please send comments to adarshrp@us.ibm.com
  • 2. Spark and Cluster Management Spark supports four different cluster managers: ●  Local: Useful only for development ●  Standalone: Bundled with Spark, doesn’t play well with other applications, fine for PoCs ●  YARN: Highly recommended for production ●  Mesos: Not supported in BigInsights Each mode has a similar “logical” architecture although physical details differ in terms of which/where processes and threads are launched.
  • 3. Spark Cluster Architecture: Logical View Driver runs the main() function of the application. This can run outside (“client”) or inside the cluster (“cluster”) SparkContext is the main entry point for Spark functionality. Represents the connection to a Spark cluster. Executor is a JVM that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors spread across a cluster. Task Driver Program SparkContext Cluster Manager Executor Cache Task
  • 4. Spark: What’s Inside an Executor? Task Task Task RDD P1 RDD P2 RDD P3 Internal Threads Single JVM Partitions from 2 different RDDs being processed by 3 tasks RDD P2 RDD P1 Cached RDD partitions from yet another RDD Shuffle, Transport, GC, and other system threads Free Task Slots (“Cores”) Executor
  • 5. Spark: Standalone Cluster Manager Worker (Client 1) (Client 2) Worker (Client 1) Client 2 Executor Executor Machine 1 Machine 2 •  Inter-process communication not shown. •  All orange boxes are JVMs •  Deploy mode = “Client” Executor Master Client 1
  • 6. Standalone Mode: Configuration Per Worker Node Per Application Per Executor CPU SPARK_WORKER_CORES spark.cores.max Memory SPARK_WORKER_MEMORY spark.executor.memory SPARK_WORKER_CORES # of cores to give to underlying Executors (default: all available cores). SPARK_WORKER_MEMORY Total memory to use on the machine, e.g. 1000m, 2g (default: total memory minus 1 GB) spark.cores.max maximum # of cores to request for the application across the cluster (default: all available cores) spark.executor.memory Memory per executor (default: 512m) Standalone mode uses as FIFO scheduler. As applications launch, it will try to balance the resource consumption across the cluster. Strangely, cores are specified per application, yet memory is per executor!
  • 7. Spark on YARN: Architecture Resource Manager Node Manager Container Container Node Manager Container Client Spark Application Master Executor Executor Machine 1 Machine 2 Machine 0 •  Inter-process communication not shown. •  All orange boxes are JVMs
  • 8. Spark Configuration Spark has scores of configuration options: •  For many options, defaults generally work alright •  However, there are some critical “knobs” that should be carefully tuned Several settings are cluster manager specific. When running Spark on YARN, you must examine: •  Yarn-specific settings: scheduler type and queues •  Spark specific settings for YARN: # of executors, per-executor memory and cores, and more Other general techniques will improve your applications on any cluster manager. For example: •  Java object serialization schemes (Kryo vs Java) •  Proper partitioning and parallelism levels •  On-disk data formats (Parquet vs AVRO vs JSON vs ...) •  And many more ... (to be covered elsewhere)
  • 9. Spark on YARN: Managing queues Your cluster may serve different applications/users, each with differing expectations: •  Batch jobs could possibly wait but interactive users may not •  Tight SLAs need to be honored often at the expense of others There may be more than one instance of the same type of application, and yet, they may need to be treated differently. E.g. different Spark jobs may have differing needs. Step 1: Divide up your cluster resources into “queues” that are organized by target needs: •  Choose scheduling strategy: Capacity vs. Fair. •  Capacity scheduler is best for applications that need guarantees on availability of cluster resources (although at the cost of elasticity) •  Fair scheduler is best for applications that want to share resources in some pre-determined proportions. •  (This aspect is not covered in this document as it’s adequately documented elsewhere) Step 2: Configure resources for Spark jobs based on the queue capacities. •  Described in the next slide Step 3: In your Spark application code, designate the right via –queue or by setting“spark.yarn.queue”
  • 10. Spark on YARN: Basic Configuration YARN Settings (Per Node, not Per Queue) Spark Settings (Per Executor) Executor Count --num-executors OR spark.executor.instances CPU yarn.nodemanager.resource.memory-mb --executor-cores OR spark.executor.cores Memory yarn.nodemanager.resource.cpu-vcores --executor-memory OR spark.executor.memory Spark internally adds an overhead to spark.executor.memory to account for off-heap JVM usage: overhead = MIN(384 MB, 10% of spark.executor.memory) // As of Spark 1.4 Yarn further adjusts requested container size: 1.  Ensures memory is a multiple of yarn.scheduler.minimum-allocation-mb. Unlike its name, this isn’t merely a minimum bound. CAUTION: Setting yarn.scheduler.minimum-allocation-mb too high can over-allocate memory because of rounding up. 2.  Ensures request size is bounded by yarn.scheduler.maximum-allocation-mb Need to specify these
  • 11. Spark on YARN: Memory Usage Inside an Executor App Objects Shuffle Cache spark.shuffle.memoryFraction Default = 0.2 (20%) Used for shuffles. Increase this for shuffle-intensive applications wherein spills happen often. spark.storage.memoryFraction Default = 0.6 (60%) Used for cached RDDs, useful if .cache() or .persist() called. This is the memory for application objects. It is what is left after setting the other two. If you’re seeing OOMs in your code, you need more memory here! Guideline: Stick with defaults, and check execution statistics to tweak settings. May need to tweak Executor memory breakdowns too:
  • 12. Spark on YARN: Sizing up Executors How many Executors? How many cores? How much memory? Setting spark.executor.memory !  Size up this number first •  Don’t use excessively large executors as GC pauses become a problem. •  Don’t use overly skinny executors since JVM overhead becomes proportionately higher •  10GB <= spark.executor.memory <= 48GB could be a good guideline? •  Choose towards the higher end when working with bigger data partitions, using large broadcast variables, etc. Setting spark.executor.instances !  Given spark.executor.memory, compute spark.executor.instances to saturate available memory. !  In reality, spark.executor.memory and spark.executor.instances are computed hand-in-hand. !  Don’t forget to account for overheads (daemons, application master, driver, etc.) •  spark.executor.instances ~ #nodes * (yarn.nodemanager.resource.memory-mb * queue-fraction / spark.executor.memory) Setting spark.executor.cores •  Over-request cores by 2 to 3 times the number of actual cores in your cluster. •  Why? Not all tasks are CPU bound at the same time.
  • 13. Spark on YARN: Sizing up Executors (Example) Sample Cluster Configuration: 8 nodes, 32 cores/node (256 total), 128 GB/node (1024 GB total) Running YARN Capacity Scheduler Spark queue has 50% of the cluster resources Naive Configuration: spark.executor.instances = 8 (one Executor per node) spark.executor.cores = 32 * 0.5 = 16 => Undersubscribed spark.executor.memory = 64 MB => GC pauses Better Configuration: spark.executor.memory = 16 GB (just as an example) spark.executor.instances = 8 * (128 GB * 0.5 / 16 GB) = 32 total spark.executor.cores = total-available-cores * over-subscription-factor / spark.executor.instances = (256 * 0.5) * 2.5 / 32 = 10 These calculations aren’t perfect -- they don’t account for overheads, for the Application Master container, etc. But hopefully you get the idea ☺ Different applications dictate different settings. EXPERIMENT and FINE TUNE!
  • 14. Spark on YARN: Exploiting Data Locality •  Spark tries to execute tasks on nodes such that there will be minimal data movement (data locality) !  Loss of data locality = suboptimal performance •  These tasks are run on executors, which are (usually) launched when a SparkContext is spawned, and well before Spark knows what data will be “touched.” •  Problem: How does Spark tell YARN where to launch Executors? •  Your application can tell Spark the list of nodes that hold data (“preferred locations”). Using a simple API, you can supply this information when instantiating a SparkContext •  See SparkContext constructor (argument preferredNodeLocationData) https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext val hdfspath = “hdfs://...” val sc = new SparkContext(sparkConf, InputFormatInfo.computePreferredLocations( Seq(new InputFormatInfo(conf, classOf[org.apache.hadoop.mapred.TextInputFormat], hdfspath ))
  • 15. Spark on YARN: Dynamic Allocation •  Prior to Release 1.3, Spark acquired all executors at application startup and held onto them for the lifetime of an application. •  Starting Release 1.3, Spark supports “dynamic allocation” of executors. This allows applications to launch executors when more tasks are queued up, and release resources when the application is idle. •  Ideally suited for many interactive applications that might have see user down-time. •  Major caveat: Spark may release executors with cached RDDs! Ouch! So if you’re application uses rdd.cache() or rdd.persist() to materialize expensive computations, you may not want to use dynamic allocation for that application. •  On the other hand, you could consider “caching” expensive computations in HDFS.
  • 16. Spark on YARN: Dynamic Allocation settings Configuration option Default Description spark.dynamicAllocation.enabled false Set to true to get elasticity spark.dynamicAllocation.minExecutors 0 Lower bound on # executors. Leave as is. spark.dynamicAllocation.maxExecutors <Infinity> Upper bound on # executors. Set based on worksheet in previous slide. spark.dynamicAllocation.executorIdleTimeout 600 secs (10 mins) How long to wait before giving up idle executors? Set to lower value, say 1 minute? spark.dynamicAllocation.schedulerBacklogTim eout spark.dynamicAllocation.sustainedSchedulerB acklogTimeout 5 secs How to launch new executors to meet incoming demand? Executors are launched in waves of exponentially increasing numbers. Leave as is.