SlideShare uma empresa Scribd logo
1 de 39
Leveraging your Hadoop
cluster better
running efficient code at scale
Michael Kopp, Technology Strategist
Why do I do this talk?
2
Effectiveness vs. Efficiency
• Effective: Adequate to accomplish a purpose; producing the
intended or expected result1
• Efficient: Performing or functioning in the best possible
manner with the least waste of time and effort1
…and resources
1) http://www.dailyblogtips.com/effective-vs-efficient-difference/
An Efficient Hadoop Cluster
• Is Effective  Gets the job done (in time)
• Highly Utilized when Active (unused resources are wasted
resources)
What is an efficient Hadoop Job?
…efficiency is a measurable concept,
quantitatively determined
by the ratio of output to input…
• same output in less time
• less resource usage with same output
and same time
• more output with same resources
in the same time
Efficient jobs are effective without
adding more hardware!
Efficiency – Using everything we have
Utilization and Distribution
CPU Spikes but no
real overall usage
Not fully
utilized
Reasons why your Cluster is not utilized
• Map and Reduce Slots
• Data Distribution
• Bottlenecks
– Spill
– Shuffle
– Reduce
– Code
Which Job(s) are dominating the cluster?
Which User? Which Pool?
Pushing the Boundaries – High Utilization
• Figure out Spill and Shuffle Bottlenecks
• Remove Idle Times, Wait Times, Sync Times
• Hotspot Analysis Tools can pinpoint those Items quickly
Identify the Jobs
Job Bottlenecks – Mapping Phase
Mapper is waiting
for Spill Thread
io.sort.spill.percent
io.sort.mb
Wait Time?
Job Bottleneck - Shuffle
Reducer is Waiting
for memory
mapred.job.shuffle.input.buffer.percent
mapred.job.reduce.total.mem.bytes
mapred.reduce.parallel.copies
Wait Time?
Cluster after simple “Fixes”
Jobs are now resource bound
Efficiency – Use what we have better
Performance Optimization
1. Identify Bounding Resource
2. Optimize and reduce its usage
3. Identify new Bounding Resource
Hot Spot Analysis Tools are again the best way to go
Identify Hotspots – which Phase
Cluster Usage
Mapping Phase Hotspot in Outage Analyzer
70% our own code!
CPU Hotspot in Mapping Phase
10% of Mapping CPU
20% of Mapping CPU
Hotspot Analysis of Reduce Phase
Wow!
Three simple Hotspots
Before Fix: 6h 30 minutes…
…After Fix: 3.5 hours
Utilization went up!
Map Reduce Run Comparison
10% of Mapping CPUReducers Running3 Reducers running
Conclusion
• Understanding your bottleneck!
• Understand bounding resource
• Small fixes can have huge yields…but requires tools
What else did we find?
• Short Mappers due to small files
– High merge time due to large number of spills
– Too much data shuffle  add Combiner but…
• Tried Task reuse
– Nearly not effect?
– 5% less Map Time, but…?
Why did the resuse not help
Map Phase over
5 more reducersshuffle
What’s next?
• Bigger Files
• Add Combiners to reduce shuffle
What about Hive or PIG?
• Identify which stage the is slow
• Identify configuration Issues
• Identify HBase or UDF issues
HBase PIG Job lasting for 15 hours…
HBase major Hotspot…
Wow!
Roundtrip for every
single Row
Cluster Utilization after fix
Performance after Fix: 75 minutes!
Summary
• Drive up utilization
• Remove Blocks and Sync points
• Optimize Big Hotspots
Michael Kopp, Technology Strategist
michael.kopp@compuware.com
@mikopp
apmblog.compuware.com
javabook.compuware.com

Mais conteúdo relacionado

Mais procurados

Beauty and Big Data
Beauty and Big DataBeauty and Big Data
Beauty and Big Data
Sri Ambati
 

Mais procurados (10)

Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Executive Level Tracking for Agile Projects
Executive Level Tracking for Agile ProjectsExecutive Level Tracking for Agile Projects
Executive Level Tracking for Agile Projects
 
Initial sprint velocity problem
Initial sprint velocity problemInitial sprint velocity problem
Initial sprint velocity problem
 
Beauty and Big Data
Beauty and Big DataBeauty and Big Data
Beauty and Big Data
 
Join 2017 - Deep Dive - Action Hub
Join 2017 - Deep Dive - Action HubJoin 2017 - Deep Dive - Action Hub
Join 2017 - Deep Dive - Action Hub
 
Processing Large Graphs in Hadoop
Processing Large Graphs in HadoopProcessing Large Graphs in Hadoop
Processing Large Graphs in Hadoop
 
PuppetConf 2017: Moving faster with Puppet & Splunk- Hal Rottenberg, Andrew B...
PuppetConf 2017: Moving faster with Puppet & Splunk- Hal Rottenberg, Andrew B...PuppetConf 2017: Moving faster with Puppet & Splunk- Hal Rottenberg, Andrew B...
PuppetConf 2017: Moving faster with Puppet & Splunk- Hal Rottenberg, Andrew B...
 
Join 2017_Deep Dive_Workflows with Zapier
Join 2017_Deep Dive_Workflows with ZapierJoin 2017_Deep Dive_Workflows with Zapier
Join 2017_Deep Dive_Workflows with Zapier
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Machine learning and TensorFlow
Machine learning and TensorFlowMachine learning and TensorFlow
Machine learning and TensorFlow
 

Destaque

PPT-Splunk-LegacySIEM-101_FINAL
PPT-Splunk-LegacySIEM-101_FINALPPT-Splunk-LegacySIEM-101_FINAL
PPT-Splunk-LegacySIEM-101_FINAL
Risi Avila
 
Tableau desktop & server
Tableau desktop & serverTableau desktop & server
Tableau desktop & server
Chris Raby
 

Destaque (20)

Tableau AWS EC2 integration architecture diagram
Tableau AWS EC2 integration architecture diagramTableau AWS EC2 integration architecture diagram
Tableau AWS EC2 integration architecture diagram
 
High Performance Big Data Loading for AWS: Deep Dive and Best Practices from ...
High Performance Big Data Loading for AWS: Deep Dive and Best Practices from ...High Performance Big Data Loading for AWS: Deep Dive and Best Practices from ...
High Performance Big Data Loading for AWS: Deep Dive and Best Practices from ...
 
Big data performance management thesis
Big data performance management thesisBig data performance management thesis
Big data performance management thesis
 
Big Data to your advantage with High-Performance Analytics
Big Data to your advantage with High-Performance AnalyticsBig Data to your advantage with High-Performance Analytics
Big Data to your advantage with High-Performance Analytics
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
Performance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
 
EMC Big Data Solutions Overview
EMC Big Data Solutions OverviewEMC Big Data Solutions Overview
EMC Big Data Solutions Overview
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Tableau Architecture
Tableau ArchitectureTableau Architecture
Tableau Architecture
 
Splunk for Security: Background & Customer Case Study
Splunk for Security: Background & Customer Case StudySplunk for Security: Background & Customer Case Study
Splunk for Security: Background & Customer Case Study
 
Splunk for Enterprise Security and User Behavior Analytics
 Splunk for Enterprise Security and User Behavior Analytics Splunk for Enterprise Security and User Behavior Analytics
Splunk for Enterprise Security and User Behavior Analytics
 
Webinar: Which Storage Architecture is Best for Splunk Analytics?
Webinar: Which Storage Architecture is Best for Splunk Analytics?Webinar: Which Storage Architecture is Best for Splunk Analytics?
Webinar: Which Storage Architecture is Best for Splunk Analytics?
 
[Webinar] Discover the Data Behind the Gambling Industry’s Online Marketing
[Webinar] Discover the Data Behind the Gambling Industry’s Online Marketing[Webinar] Discover the Data Behind the Gambling Industry’s Online Marketing
[Webinar] Discover the Data Behind the Gambling Industry’s Online Marketing
 
PPT-Splunk-LegacySIEM-101_FINAL
PPT-Splunk-LegacySIEM-101_FINALPPT-Splunk-LegacySIEM-101_FINAL
PPT-Splunk-LegacySIEM-101_FINAL
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
Tableau desktop & server
Tableau desktop & serverTableau desktop & server
Tableau desktop & server
 
Teradata introduction - A basic introduction for Taradate system Architecture
Teradata introduction - A basic introduction for Taradate system ArchitectureTeradata introduction - A basic introduction for Taradate system Architecture
Teradata introduction - A basic introduction for Taradate system Architecture
 
Teradata Architecture
Teradata Architecture Teradata Architecture
Teradata Architecture
 

Semelhante a Leveraging your hadoop cluster better - running performant code at scale

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 

Semelhante a Leveraging your hadoop cluster better - running performant code at scale (20)

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Hadoop
HadoopHadoop
Hadoop
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Leveraging your hadoop cluster better - running performant code at scale

Notas do Editor

  1. Why did I do this talk, well this is it.
  2. In other words, from a cluster perspective efficency means using every resource available! Not being idle.
  3. I could simply add more map and reduce slots and try to pound the cluster. But that might not be really good for all jobs and further more at some point I will run into load average issues, meaning too much scheduling and becomes counter productive.
  4. We want to figure out which Jobs are running, which consume most most of my cluster but at the same time don’t consume the resources. E.g. we can compare time vs. CPU time used by a job
  5. The same we can do on a per user or pool basis. By using these two methods we quickly figure out which job/user occupies the cluster but is not running optimally. We will then look at those closer.
  6. What do hotspot analysis tools do? Well if you are a developer you know what a Profiler is doing, it tells you where you spend most of your time and also CPU. The problem is that profilers can not be run distributed and they have a horrible impact on performance, they also distort hotspots if the hotspot is really a fast method called billions of times. In other words profilers are not useful for hadoop. Than there are CPU samplers. Better for hadoop, not so much impact, but again distribted is hard. Also Samplers miss context in the sense that they look at thread stack traces without the context of what’s going on. And than there are modern APM solutions, that provide the best of both worlds and then some. These solutions can deliver the value of a profiler and sample without the overhead, can be distributed and provide context.
  7. You can use these to look at high level hotspots of a job. E.g. this was a job that ran for 6 hours total across 10 servers in EC2. Now this does not show me every little detail, I don’t care about that. But it shows me the big hotspots, and for that it gives me detaul. E.g. that blue block 9 hours out of 65 hours accounted time
  8. I can also go the other way around BTW, let’s say I see that my Cluster is spending a lot of time waiting, I can easyily figure out which jobs are running of course, but better, I can simply do a hotspot to check what my Task JVMs are doing, and then have the APM solution tell me which job, user this is.
  9. Add Number of Tasks per Job, Job Percentage Tracking.
  10. Map Phase and Reduce phase are the same time. Looked at slots, and the reducer is not using the full cluster, but also it can’t. reducing cannot scale as much as mapping. We also see that the reduce phase drops off at the end for the last hour or so.So while mapping consumes a lot more time, reducing is a bottleneck so every optimization there will count twice! Let’s keep that in mind.
  11. From 58h of Mapping Time to 48 hours
  12. One was thealreadymentionedregex.Another was thatweinitialized a SimpleDateFormater for every observation aka. Map. Now that was a big issue, because not only was it creating the object each time, it was getting the locale, reading the resource boundle, calculating the current date and much much more. Why did the dev do it? because SimpleDateformater is not thread safe, so you cannot make it static very simply. Anyway this single thing amount to about half of our CPU usage! A third thing was that we are parsing data among other things numbers. An empty string is not a number and thus leads to a number format exception which we handled. However the simple fact that millions of these exceptions were thrown and cought amounted to 10% of our CPU time.We fixed these 3 simple issues, and our reduce phase was 6 times faster. To put it in perspective it went from 3 hours to 30 minutes on top of the map phase!
  13. The files we were working on comprised 5 mintues of data, aka ~500MB uncompressed and 50MB compressed. Our average map time was only about 3-5 minutes. While that is not horrible it still means that we have considerable startup overheadMap Time came down from 2:35 to 2:30 which isn’t much, but the actual job time did not change at all and remained at little over three hours?=
  14. First of all we see that before and after we are fully CPU bound, but actually its not easy to see here, but utilization improved. We were on 95-97 for the mapping phase before and are now at 98-99. really awesome.