Making sense of performance and identifying stragglers in Data Analytics Framework

•Transferir como PPTX, PDF•

1 gostou•140 visualizações

Making sense of performance and identifying stragglers in Data Analytics Framework. Study was done on a AWS cluster on hadoop and spark

Dados e análise

Making sense of performance
and identifying stragglers in
Data Analytics Framework
CSCI 8780 Advanced Distributed Systems
Manish Ranjan and Narita Pandhe

Introduction
- Large-scale data analytics has become widespread
- Research devoted to improving the performance of data analytics
frameworks
- BUT comparatively little effort : spent in identifying the performance
bottlenecks!!
2

What Cluster Configuration did we use?
- #1 Master, #6 Slaves
- Master Config
- 64 - Bit,
- 8GB RAM,
- 2 Cores,
- 50GB SSD
- Slaves Config(each):
- 64 - Bit
11

First Benchmarking namenode
To first test Namenode hardware and config: NNBench
What it does:
Generates a lot of HDFS related requests
Why it does:
To put a “HIGH” HDFS management stress on the namenode
How it does:
Simulates request for creating, reading, renaming and deleting files on HDFS 12

What Workload did we use?
- TeraSort benchmark suite
- Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as
fast as possible.
- Limited by our cluster configuration, we performed several experiments with data of size
1GB, 5GB and 10GB.
- TeraSort benchmark can be utilized to iron out your Hadoop configuration
13

14
Hadoop
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)

15
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Red : s6
Dark Green: s4

16
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Observations for 10GB
Red : s6
Dark Green: s4

17
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Observations for 10GB
Red : s6
Dark Green: s4

18
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Identified Stragglers

19
Spark
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Orange: s2
Red: s6

20
Hadoop
SparkRed
s6Bright Blue :
s5
Orange : s2

Conclusions
- Straggler task spends an unusually long amount of time in a particular part of task
execution.
- It usually not too hard to found a straggler for a specific execution- what is hard is to
get it consistently enough!
- Though we were lucky enough to spot few even in a mediocre strength cluster. Which
emphasizes the necessity of understanding the cluster meta info well.
Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection
- Since, Spark:
- often breaks jobs into many more tasks 21

References
- Making Sense of Performance in Data Analytics Frameworks,
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI,
VMware, Seoul National University
- No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf
- http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-
cluster-with-terasort-testdfsio-nnbench-mrbench/
- https://github.com/ehiggs/spark-terasort
- aws.amazon.com
22

Mais conteúdo relacionado

Mais procurados

Meeting20150109 v1Jean-Baptiste Poullet

Introduction of R on HadoopChung-Tsai Su

Data Are from Mars, Tools Are from VenusThe HDF-EOS Tools and Information Center

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa

Hive at Last.fmSkills Matter

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere

Pg nordic-day-2014-2 tb-enoughRenaud Bruyeron

20180323 dll standardHirono Jumpei

Tainted LOBMarcus Davage

SparkTokyo2019Kazuaki Ishizaki

Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau

Scalable Machine Learning with PySparkLadle Patel

Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

PySparkの勘所（20170630 sapporo db analytics showcase） Ryuji Tamagawa

Apache Spark Fundamentals TrainingEren Avşaroğulları

oracle 11G RAC Trianing Noida Delhi NCRShri Prakash Pandey

Mais procurados (17)

Meeting20150109 v1

Introduction of R on Hadoop

Data Are from Mars, Tools Are from Venus

20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所

Hive at Last.fm

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

Pg nordic-day-2014-2 tb-enough

20180323 dll standard

Tainted LOB

SparkTokyo2019

Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup

Scalable Machine Learning with PySpark

Intro to PySpark: Python Data Analysis at scale in the Cloud

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab

PySparkの勘所（20170630 sapporo db analytics showcase）

Apache Spark Fundamentals Training

oracle 11G RAC Trianing Noida Delhi NCR

Semelhante a Making sense of performance and identifying stragglers in Data Analytics Framework

Pydata talkTuri, Inc.

IPv6 infrastructure and multicasting status reportEthern Lin

Apache CarbonData:New high performance data format for faster data analysisliang chen

IPV6 Hands on Lab Cisco Canada

Summary of Journal_ShenLu_Summer2013Shen Lu

Day 20.i pv6 labCYBERINTELLIGENTS

Apache Spark Best Practices Meetup TalkEren Avşaroğulları

Accelerating apache spark with rdmainside-BigData.com

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community

Update on IPv6 activity in CERNET2APNIC

IPv6 Infrastructures of ASIX6Ethern Lin

Ncar globally accessible user environmentinside-BigData.com

Ipv6 deployment at the university of warwick - networkshop44Jisc

Ceph Performance Profiling and ReportingCeph Community

Performance Analysis of Ipv4 Ipv6 Transition TechniquesAndy Juan Sarango Veliz

Apache Cassandra at MacysDataStax Academy

The state of SQL-on-Hadoop in the CloudNicolas Poggi

Ieee nfv-sdn-2020-srv6-tutorialStefano Salsano

Hyperspace for Delta LakeDatabricks

Getting started with IPv6Private

Semelhante a Making sense of performance and identifying stragglers in Data Analytics Framework (20)

Pydata talk

IPv6 infrastructure and multicasting status report

Apache CarbonData:New high performance data format for faster data analysis

IPV6 Hands on Lab

Summary of Journal_ShenLu_Summer2013

Day 20.i pv6 lab

Apache Spark Best Practices Meetup Talk

Accelerating apache spark with rdma

Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong

Update on IPv6 activity in CERNET2

IPv6 Infrastructures of ASIX6

Ncar globally accessible user environment

Ipv6 deployment at the university of warwick - networkshop44

Ceph Performance Profiling and Reporting

Performance Analysis of Ipv4 Ipv6 Transition Techniques

Apache Cassandra at Macys

The state of SQL-on-Hadoop in the Cloud

Ieee nfv-sdn-2020-srv6-tutorial

Hyperspace for Delta Lake

Getting started with IPv6

Último

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

RadioAdProWritingCinderellabyButleri.pdfgstagge

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

ASML's Taxonomy Adventure by Daniel Cantervoginip

Learn How Data Science Changes Our WorldEduminds Learning

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Making sense of performance and identifying stragglers in Data Analytics Framework

1. Making sense of performance and identifying stragglers in Data Analytics Framework CSCI 8780 Advanced Distributed Systems Manish Ranjan and Narita Pandhe

2. Introduction - Large-scale data analytics has become widespread - Research devoted to improving the performance of data analytics frameworks - BUT comparatively little effort : spent in identifying the performance bottlenecks!! 2

3. More resource efficient Faster 3

4. 4

5. 5

6. 6

7. 7

8. 8

9. 9

10. Experiments 10

11. What Cluster Configuration did we use? - #1 Master, #6 Slaves - Master Config - 64 - Bit, - 8GB RAM, - 2 Cores, - 50GB SSD - Slaves Config(each): - 64 - Bit 11

12. First Benchmarking namenode To first test Namenode hardware and config: NNBench What it does: Generates a lot of HDFS related requests Why it does: To put a “HIGH” HDFS management stress on the namenode How it does: Simulates request for creating, reading, renaming and deleting files on HDFS 12

13. What Workload did we use? - TeraSort benchmark suite - Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as fast as possible. - Limited by our cluster configuration, we performed several experiments with data of size 1GB, 5GB and 10GB. - TeraSort benchmark can be utilized to iron out your Hadoop configuration 13

14. 14 Hadoop i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6)

15. 15 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Red : s6 Dark Green: s4

16. 16 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Observations for 10GB Red : s6 Dark Green: s4

17. 17 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Observations for 10GB Red : s6 Dark Green: s4

18. 18 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Identified Stragglers

19. 19 Spark i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Orange: s2 Red: s6

20. 20 Hadoop SparkRed s6Bright Blue : s5 Orange : s2

21. Conclusions - Straggler task spends an unusually long amount of time in a particular part of task execution. - It usually not too hard to found a straggler for a specific execution- what is hard is to get it consistently enough! - Though we were lucky enough to spot few even in a mediocre strength cluster. Which emphasizes the necessity of understanding the cluster meta info well. Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection - Since, Spark: - often breaks jobs into many more tasks 21

22. References - Making Sense of Performance in Data Analytics Frameworks, Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI, VMware, Seoul National University - No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf - http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop- cluster-with-terasort-testdfsio-nnbench-mrbench/ - https://github.com/ehiggs/spark-terasort - aws.amazon.com 22

23. 23

24. 24

Notas do Editor

A straggler is a task with inverse progress rate greater than 1.5× the median inverse progress rate for the stage. Many stragglers can be explained by the fact that the straggler task spends an unusually long amount of time in a particular part of task execution. Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection
efe - mostly always lagging behind

Making sense of performance and identifying stragglers in Data Analytics Framework

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (17)

Semelhante a Making sense of performance and identifying stragglers in Data Analytics Framework

Semelhante a Making sense of performance and identifying stragglers in Data Analytics Framework (20)

Último

Último (20)

Making sense of performance and identifying stragglers in Data Analytics Framework

Notas do Editor