Defining Constituents, Data Vizzes and Telling a Data Story
Making sense of performance and identifying stragglers in Data Analytics Framework
1. Making sense of performance
and identifying stragglers in
Data Analytics Framework
CSCI 8780 Advanced Distributed Systems
Manish Ranjan and Narita Pandhe
2. Introduction
- Large-scale data analytics has become widespread
- Research devoted to improving the performance of data analytics
frameworks
- BUT comparatively little effort : spent in identifying the performance
bottlenecks!!
2
11. What Cluster Configuration did we use?
- #1 Master, #6 Slaves
- Master Config
- 64 - Bit,
- 8GB RAM,
- 2 Cores,
- 50GB SSD
- Slaves Config(each):
- 64 - Bit
11
12. First Benchmarking namenode
To first test Namenode hardware and config: NNBench
What it does:
Generates a lot of HDFS related requests
Why it does:
To put a “HIGH” HDFS management stress on the namenode
How it does:
Simulates request for creating, reading, renaming and deleting files on HDFS 12
13. What Workload did we use?
- TeraSort benchmark suite
- Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as
fast as possible.
- Limited by our cluster configuration, we performed several experiments with data of size
1GB, 5GB and 10GB.
- TeraSort benchmark can be utilized to iron out your Hadoop configuration
13
21. Conclusions
- Straggler task spends an unusually long amount of time in a particular part of task
execution.
- It usually not too hard to found a straggler for a specific execution- what is hard is to
get it consistently enough!
- Though we were lucky enough to spot few even in a mediocre strength cluster. Which
emphasizes the necessity of understanding the cluster meta info well.
Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection
- Since, Spark:
- often breaks jobs into many more tasks 21
22. References
- Making Sense of Performance in Data Analytics Frameworks,
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI,
VMware, Seoul National University
- No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf
- http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-
cluster-with-terasort-testdfsio-nnbench-mrbench/
- https://github.com/ehiggs/spark-terasort
- aws.amazon.com
22
A straggler is a task with inverse progress rate greater than 1.5× the median inverse progress rate for the stage.
Many stragglers can be explained by the fact that the straggler task spends an unusually long amount of time
in a particular part of task execution. Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection