Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Apache Spark Performance is too hard. Let's make it easier
1. Sean Suchter
CTO @ Pepperdata
Spark performance is too hard,
let’s make it easier
2. Pepperdata does performance (for Big Data)
15
Thousand
Production
Nodes
50
Million
Jobs/Year
200
Trillion
Performance
Data Points
3. Today’s talk will cover…
• How code translates to execution
• How to find common, known problems
• For the rest of the problems…
– Why debugging performance problems is hard
– Data elements needed for complete view of application
performance from separate tools
– Bringing these elements together in a single tool
4. Brief terminology about Spark
• An app contains
multiple jobs
• A job contains
multiple stages
• A stage contains
multiple tasks
• Executors run tasks
5. Example App
A word count app:
val textFile = sc.textFile("hdfs:/dict.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs:/wordcounts.txt")
1. Declares input from
external storage
2. Specifies
transformations
3. Triggers an action
6. Distributed Architecture
Spark executes a job using
multiple machines.
Spark
Driver
process
Spark
Executor 1
process
Spark
Executor 2
process
Spark
Executor N
process
Sends tasks
17. 3 Classes of Spark Heuristics
• Configuration Settings
• Simple Alarms on Stage/Job Failure
• Data-Dependent Tuning Suggestions
17
18. Configuration Heuristic
• Display some basic config settings for your app
• Complain if some settings not explicitly set
• Recommend configuring an external shuffle
service (especially if dynamic allocation is
enabled)
• These recommendations won’t change over
multiple runs of an application
18
19. Stages and Jobs Heuristics
• Simple alarms showing stage and job failure rates
• Good for seeing when there’s a problem
19
20. Executors Heuristic
• Looks at the distribution across executors of
several different metrics
• Outliers in these distributions probably indicate:
– Suboptimal partitioning.
– One or more slow executors due to external
circumstances (cluster weather)
20
21. Partitions Heuristic
• Ideally data for each task will fit into the RAM
available to that task.
• Sandy Ryza (once from Cloudera) has an
excellent blog on Spark tuning:
(observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores)
(observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
21
24. Pepperdata Application Profiler
• Benefits to our users:
– Provide simple answers to simple questions
– Combination of metrics for experts
– Simple actionable insights for all users
– Pepperdata support
• Why stay close to open source?
– Heuristics
24
28. Reason #1
Same external symptom (“too slow”), but many possible
causes:
• code
• data
• configuration
• cluster weather
29. Reason #2
Existing tools provide limited visibility
• Spark Web UI is the most popular
– Good view of query execution plan (job/stages/DAG)
– Limited view of aggregate performance data
• Time series
– Ganglia, Ambari, CM, etc provide time series data for cluster (but
not specific to Spark apps)
– Spark Sink metrics can be fed to InfluxDb/others, yielding partial
Spark app metrics
• Code execution not connected to resource consumption
• Load from other apps unaccounted
30. 3 data elements form a complete picture
of Spark application performance
1. Code execution plan
– Indicates which block of code is being executed, where
2. Time series view
– Visual of resource consumption of application
– Outliers in resource usage very easy to detect
3. Cluster weather
– A view of all applications that run on the cluster
50. Code Analyzer for Apache Spark
• Free during Early Access starting today
• Early Access is for development teams
• To learn more visit booth #101
• info@pepperdata.com
pepperdata.com/products/code-analyzer
51. Other performance tools mentioned
• Dr Elephant
– github.com/linkedin/dr-elephant
• Application Profiler
– www.pepperdata.com/products/application-profiler/
52. To recap
• Use heuristics to find known problems
• Execution plan + time series = powerful visualization
• Knowing cluster weather can prevent time wasted
debugging performance “issues” that aren’t the app’s
fault
53. Spark Summit Talk Plugs
Tuesday 11:40AM Connect Code to Resource Consumption to Scale Your
Production Spark Applications (Vinod @ Pepperdata)
Tuesday 12:50PM Kubernetes SIG Big Data Birds-of-a-Feather session
(many)
Tuesday 3:20PM Apache Spark on Kubernetes (Anirudh @ Google, Tim @
Hyperpilot)
Wednesday 11:00AM HDFS on Kubernetes – Lessons Learned (Kimoon @
Pepperdata)
Wednesday 11:00AM Dr Elephant for Monitoring and Tuning Apache Spark Jobs
on Hadoop (Carl @ LinkedIn, Simon @ Pepperdata)