SlideShare uma empresa Scribd logo
1 de 24
Making sense of performance
and identifying stragglers in
Data Analytics Framework
CSCI 8780 Advanced Distributed Systems
Manish Ranjan and Narita Pandhe
Introduction
- Large-scale data analytics has become widespread
- Research devoted to improving the performance of data analytics
frameworks
- BUT comparatively little effort : spent in identifying the performance
bottlenecks!!
2
More resource efficient Faster
3
4
5
6
7
8
9
Experiments
10
What Cluster Configuration did we use?
- #1 Master, #6 Slaves
- Master Config
- 64 - Bit,
- 8GB RAM,
- 2 Cores,
- 50GB SSD
- Slaves Config(each):
- 64 - Bit
11
First Benchmarking namenode
To first test Namenode hardware and config: NNBench
What it does:
Generates a lot of HDFS related requests
Why it does:
To put a “HIGH” HDFS management stress on the namenode
How it does:
Simulates request for creating, reading, renaming and deleting files on HDFS 12
What Workload did we use?
- TeraSort benchmark suite
- Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as
fast as possible.
- Limited by our cluster configuration, we performed several experiments with data of size
1GB, 5GB and 10GB.
- TeraSort benchmark can be utilized to iron out your Hadoop configuration
13
14
Hadoop
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
15
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Red : s6
Dark Green: s4
16
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Observations for 10GB
Red : s6
Dark Green: s4
17
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Observations for 10GB
Red : s6
Dark Green: s4
18
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Identified Stragglers
19
Spark
i-6c76c1da (M),
i-40684ef0 (s1),
i-41684ef1 (s2),
i-42684ef2 (s3),
i-43684ef3 (s4),
i-4e684efe (s5),
i-4f684eff (s6)
Orange: s2
Red: s6
20
Hadoop
SparkRed
s6Bright Blue :
s5
Orange : s2
Conclusions
- Straggler task spends an unusually long amount of time in a particular part of task
execution.
- It usually not too hard to found a straggler for a specific execution- what is hard is to
get it consistently enough!
- Though we were lucky enough to spot few even in a mediocre strength cluster. Which
emphasizes the necessity of understanding the cluster meta info well.
Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection
- Since, Spark:
- often breaks jobs into many more tasks 21
References
- Making Sense of Performance in Data Analytics Frameworks,
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI,
VMware, Seoul National University
- No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf
- http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-
cluster-with-terasort-testdfsio-nnbench-mrbench/
- https://github.com/ehiggs/spark-terasort
- aws.amazon.com
22
23
24

Mais conteúdo relacionado

Mais procurados

Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on HadoopChung-Tsai Su
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughRenaud Bruyeron
 
20180323 dll standard
20180323 dll standard20180323 dll standard
20180323 dll standardHirono Jumpei
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySparkLadle Patel
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) Ryuji Tamagawa
 
oracle 11G RAC Trianing Noida Delhi NCR
oracle 11G RAC Trianing Noida Delhi NCRoracle 11G RAC Trianing Noida Delhi NCR
oracle 11G RAC Trianing Noida Delhi NCRShri Prakash Pandey
 

Mais procurados (17)

Meeting20150109 v1
Meeting20150109 v1Meeting20150109 v1
Meeting20150109 v1
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Data Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from VenusData Are from Mars, Tools Are from Venus
Data Are from Mars, Tools Are from Venus
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enough
 
20180323 dll standard
20180323 dll standard20180323 dll standard
20180323 dll standard
 
Tainted LOB
Tainted LOBTainted LOB
Tainted LOB
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Intro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the CloudIntro to PySpark: Python Data Analysis at scale in the Cloud
Intro to PySpark: Python Data Analysis at scale in the Cloud
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase)
 
Apache Spark Fundamentals Training
Apache Spark Fundamentals TrainingApache Spark Fundamentals Training
Apache Spark Fundamentals Training
 
oracle 11G RAC Trianing Noida Delhi NCR
oracle 11G RAC Trianing Noida Delhi NCRoracle 11G RAC Trianing Noida Delhi NCR
oracle 11G RAC Trianing Noida Delhi NCR
 

Semelhante a Making sense of performance and identifying stragglers in Data Analytics Framework

IPv6 infrastructure and multicasting status report
IPv6 infrastructure and multicasting status reportIPv6 infrastructure and multicasting status report
IPv6 infrastructure and multicasting status reportEthern Lin
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
IPV6 Hands on Lab
IPV6 Hands on Lab IPV6 Hands on Lab
IPV6 Hands on Lab Cisco Canada
 
Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Shen Lu
 
Apache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkApache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkEren Avşaroğulları
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdmainside-BigData.com
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Update on IPv6 activity in CERNET2
Update on IPv6 activity in CERNET2Update on IPv6 activity in CERNET2
Update on IPv6 activity in CERNET2APNIC
 
IPv6 Infrastructures of ASIX6
IPv6 Infrastructures of ASIX6IPv6 Infrastructures of ASIX6
IPv6 Infrastructures of ASIX6Ethern Lin
 
Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environmentinside-BigData.com
 
Ipv6 deployment at the university of warwick - networkshop44
Ipv6 deployment at the university of warwick - networkshop44Ipv6 deployment at the university of warwick - networkshop44
Ipv6 deployment at the university of warwick - networkshop44Jisc
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Community
 
Performance Analysis of Ipv4 Ipv6 Transition Techniques
Performance Analysis of Ipv4 Ipv6 Transition TechniquesPerformance Analysis of Ipv4 Ipv6 Transition Techniques
Performance Analysis of Ipv4 Ipv6 Transition TechniquesAndy Juan Sarango Veliz
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Ieee nfv-sdn-2020-srv6-tutorial
Ieee nfv-sdn-2020-srv6-tutorialIeee nfv-sdn-2020-srv6-tutorial
Ieee nfv-sdn-2020-srv6-tutorialStefano Salsano
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Getting started with IPv6
Getting started with IPv6Getting started with IPv6
Getting started with IPv6Private
 

Semelhante a Making sense of performance and identifying stragglers in Data Analytics Framework (20)

Pydata talk
Pydata talkPydata talk
Pydata talk
 
IPv6 infrastructure and multicasting status report
IPv6 infrastructure and multicasting status reportIPv6 infrastructure and multicasting status report
IPv6 infrastructure and multicasting status report
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
IPV6 Hands on Lab
IPV6 Hands on Lab IPV6 Hands on Lab
IPV6 Hands on Lab
 
Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013
 
Day 20.i pv6 lab
Day 20.i pv6 labDay 20.i pv6 lab
Day 20.i pv6 lab
 
Apache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup TalkApache Spark Best Practices Meetup Talk
Apache Spark Best Practices Meetup Talk
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Update on IPv6 activity in CERNET2
Update on IPv6 activity in CERNET2Update on IPv6 activity in CERNET2
Update on IPv6 activity in CERNET2
 
IPv6 Infrastructures of ASIX6
IPv6 Infrastructures of ASIX6IPv6 Infrastructures of ASIX6
IPv6 Infrastructures of ASIX6
 
Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environment
 
Ipv6 deployment at the university of warwick - networkshop44
Ipv6 deployment at the university of warwick - networkshop44Ipv6 deployment at the university of warwick - networkshop44
Ipv6 deployment at the university of warwick - networkshop44
 
Ceph Performance Profiling and Reporting
Ceph Performance Profiling and ReportingCeph Performance Profiling and Reporting
Ceph Performance Profiling and Reporting
 
Performance Analysis of Ipv4 Ipv6 Transition Techniques
Performance Analysis of Ipv4 Ipv6 Transition TechniquesPerformance Analysis of Ipv4 Ipv6 Transition Techniques
Performance Analysis of Ipv4 Ipv6 Transition Techniques
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Ieee nfv-sdn-2020-srv6-tutorial
Ieee nfv-sdn-2020-srv6-tutorialIeee nfv-sdn-2020-srv6-tutorial
Ieee nfv-sdn-2020-srv6-tutorial
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Getting started with IPv6
Getting started with IPv6Getting started with IPv6
Getting started with IPv6
 

Último

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Último (20)

Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Making sense of performance and identifying stragglers in Data Analytics Framework

  • 1. Making sense of performance and identifying stragglers in Data Analytics Framework CSCI 8780 Advanced Distributed Systems Manish Ranjan and Narita Pandhe
  • 2. Introduction - Large-scale data analytics has become widespread - Research devoted to improving the performance of data analytics frameworks - BUT comparatively little effort : spent in identifying the performance bottlenecks!! 2
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. 9
  • 11. What Cluster Configuration did we use? - #1 Master, #6 Slaves - Master Config - 64 - Bit, - 8GB RAM, - 2 Cores, - 50GB SSD - Slaves Config(each): - 64 - Bit 11
  • 12. First Benchmarking namenode To first test Namenode hardware and config: NNBench What it does: Generates a lot of HDFS related requests Why it does: To put a “HIGH” HDFS management stress on the namenode How it does: Simulates request for creating, reading, renaming and deleting files on HDFS 12
  • 13. What Workload did we use? - TeraSort benchmark suite - Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as fast as possible. - Limited by our cluster configuration, we performed several experiments with data of size 1GB, 5GB and 10GB. - TeraSort benchmark can be utilized to iron out your Hadoop configuration 13
  • 14. 14 Hadoop i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6)
  • 15. 15 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Red : s6 Dark Green: s4
  • 16. 16 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Observations for 10GB Red : s6 Dark Green: s4
  • 17. 17 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Observations for 10GB Red : s6 Dark Green: s4
  • 18. 18 i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Identified Stragglers
  • 19. 19 Spark i-6c76c1da (M), i-40684ef0 (s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4), i-4e684efe (s5), i-4f684eff (s6) Orange: s2 Red: s6
  • 21. Conclusions - Straggler task spends an unusually long amount of time in a particular part of task execution. - It usually not too hard to found a straggler for a specific execution- what is hard is to get it consistently enough! - Though we were lucky enough to spot few even in a mediocre strength cluster. Which emphasizes the necessity of understanding the cluster meta info well. Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection - Since, Spark: - often breaks jobs into many more tasks 21
  • 22. References - Making Sense of Performance in Data Analytics Frameworks, Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI, VMware, Seoul National University - No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf - http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop- cluster-with-terasort-testdfsio-nnbench-mrbench/ - https://github.com/ehiggs/spark-terasort - aws.amazon.com 22
  • 23. 23
  • 24. 24

Notas do Editor

  1. A straggler is a task with inverse progress rate greater than 1.5× the median inverse progress rate for the stage. Many stragglers can be explained by the fact that the straggler task spends an unusually long amount of time in a particular part of task execution. Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection
  2. efe - mostly always lagging behind