Leveraging your hadoop cluster better - running performant code at scale

•

2 gostaram•1,108 visualizações

Somebody once said that hadoop is a way of running highly unperformant code at scale. In this talk I want to show how we can change that and make map reduce jobs more performant. I will show how to analyze them at scale and optimize the job itself, instead of just tinkering with hadoop options. The result is a much better utilized cluster and jobs that run in a fraction of the original time running performant code at scale! Most of the time when speaking about Hadoop people only consider scale, however, when looking at it it very often runs highly unperformant jobs. By actually looking at the performance characteristics of the jobs themselves and optimizing and tuning those far better results can be achieved. Examples include small changes that cut jobs down from 15 hours to 2 hours without adding any more hardware. The concepts and techniques explained in the talk will be applicable regardless which tool is used to identify the performance characteristics, what is important is that by applying performance analysis and optimization techniques that we have used on other applications for a long time we can make hadoop jobs much more effective and performant! The attendees will be able to understand those techniques and apply them to their map/reduce/PIG/hive or other mapreduce jobs.

Tecnologia Negócios

Leveraging your Hadoop
cluster better
running efficient code at scale
Michael Kopp, Technology Strategist

Effectiveness vs. Efficiency
• Effective: Adequate to accomplish a purpose; producing the
intended or expected result1
• Efficient: Performing or functioning in the best possible
manner with the least waste of time and effort1
…and resources
1) http://www.dailyblogtips.com/effective-vs-efficient-difference/

An Efficient Hadoop Cluster
• Is Effective  Gets the job done (in time)
• Highly Utilized when Active (unused resources are wasted
resources)

What is an efficient Hadoop Job?
…efficiency is a measurable concept,
quantitatively determined
by the ratio of output to input…
• same output in less time
• less resource usage with same output
and same time
• more output with same resources
in the same time
Efficient jobs are effective without
adding more hardware!

Utilization and Distribution
CPU Spikes but no
real overall usage
Not fully
utilized

Reasons why your Cluster is not utilized
• Map and Reduce Slots
• Data Distribution
• Bottlenecks
– Spill
– Shuffle
– Reduce
– Code

Which Job(s) are dominating the cluster?

Pushing the Boundaries – High Utilization
• Figure out Spill and Shuffle Bottlenecks
• Remove Idle Times, Wait Times, Sync Times
• Hotspot Analysis Tools can pinpoint those Items quickly

Job Bottlenecks – Mapping Phase
Mapper is waiting
for Spill Thread
io.sort.spill.percent
io.sort.mb
Wait Time?

Job Bottleneck - Shuffle
Reducer is Waiting
for memory
mapred.job.shuffle.input.buffer.percent
mapred.job.reduce.total.mem.bytes
mapred.reduce.parallel.copies
Wait Time?

Performance Optimization
1. Identify Bounding Resource
2. Optimize and reduce its usage
3. Identify new Bounding Resource
Hot Spot Analysis Tools are again the best way to go

Mapping Phase Hotspot in Outage Analyzer
70% our own code!

CPU Hotspot in Mapping Phase
10% of Mapping CPU
20% of Mapping CPU

…After Fix: 3.5 hours
Utilization went up!

Map Reduce Run Comparison
10% of Mapping CPUReducers Running3 Reducers running

Conclusion
• Understanding your bottleneck!
• Understand bounding resource
• Small fixes can have huge yields…but requires tools

What else did we find?
• Short Mappers due to small files
– High merge time due to large number of spills
– Too much data shuffle  add Combiner but…
• Tried Task reuse
– Nearly not effect?
– 5% less Map Time, but…?

Why did the resuse not help
Map Phase over
5 more reducersshuffle

What’s next?
• Bigger Files
• Add Combiners to reduce shuffle

What about Hive or PIG?
• Identify which stage the is slow
• Identify configuration Issues
• Identify HBase or UDF issues

HBase major Hotspot…
Wow!
Roundtrip for every
single Row

Summary
• Drive up utilization
• Remove Blocks and Sync points
• Optimize Big Hotspots

Michael Kopp, Technology Strategist
michael.kopp@compuware.com
@mikopp
apmblog.compuware.com
javabook.compuware.com

Mais conteúdo relacionado

Mais procurados

Dataiku big data paris - the rise of the hadoop ecosystem

Dataiku

Executive Level Tracking for Agile Projects

Elliot Susel

Initial sprint velocity problem

Dejan Radic

Beauty and Big Data

Sri Ambati

Join 2017 - Deep Dive - Action Hub

Looker

Processing Large Graphs in Hadoop

Dani Solà Lagares

Together, Splunk + Puppet have the capabilities to address multiple enterprise initiatives that allow organizations to analyze with intelligence and act quickly - all while successfully scaling their DevOps and transformational practices across the enterprise. Unifying machine data analytics with infrastructure and application orchestration enables IT to speed time from insight to action. In this session, learn how Puppet + Splunk are monitoring, troubleshooting and automating all the things with a set of new integrations to make your DevOps transformation a reality!

PuppetConf 2017: Moving faster with Puppet & Splunk- Hal Rottenberg, Andrew B...

Puppet

Join 2017_Deep Dive_Workflows with Zapier

Looker

Distributed Deep Learning on Spark

Mathieu Dumoulin

Machine learning and TensorFlow

Jose Papo, MSc

Mais procurados (10)

Dataiku big data paris - the rise of the hadoop ecosystem

Executive Level Tracking for Agile Projects

Initial sprint velocity problem

Beauty and Big Data

Join 2017 - Deep Dive - Action Hub

Processing Large Graphs in Hadoop

PuppetConf 2017: Moving faster with Puppet & Splunk- Hal Rottenberg, Andrew B...

Join 2017_Deep Dive_Workflows with Zapier

Distributed Deep Learning on Spark

Machine learning and TensorFlow

Destaque

Tableau AWS EC2 integration architecture diagram

Vaidy Krishnan

Companies are increasingly dealing with large data sets and looking for ways to increase the scale and lower the cost of Big Data analysis with AWS. In this interactive session, you’ll learn how to: * Integrate massive data volumes, from any on-premises or cloud data sources into AWS with Informatica’s high performance cloud integration connectors and Vibe Secure Agent technology. * Transform and load data into RDS, Redshift, and S3 without the need for coding. * Automate streaming data collection into Kinesis with built-in high availability and failover features.

High Performance Big Data Loading for AWS: Deep Dive and Best Practices from ...

Amazon Web Services

This is an MBA thesis, my aim is to understand the fascinating topic of Dig Data more thoroughly and to try to differentiate realities and myths about Big Data. At the same time, I’m hoping to suggest a practical framework that can be used by ambitious organizations to evaluate and guide their performance in terms of Big Data. Critical literature review about the topic, synthesizing inputs from subject matter experts and review successful implementation case studies in contemporary organizations were conducted to build up main pillars for this framework.

Big data performance management thesis

Ahmad Muammar

Big Data to your advantage with High-Performance Analytics

SAS Institute India Pvt. Ltd

SplunkSummit 2015 - Real World Big Data Architecture

Splunk

Performance Management in ‘Big Data’ Applications

Michael Kopp

EMC Big Data Solutions Overview

walshe1

Taking Splunk to the Next Level - Architecture Breakout Session

Splunk

Towards a Systematic Study of Big Data Performance and Benchmarking

Saliya Ekanayake

Tableau Architecture

Vivek Mohan

Splunk for Security: Background & Customer Case Study

Andrew Gerber

This session will review Splunk’s two premium solutions for information security organizations: Splunk for Enterprise Security (ES) and Splunk User Behavior Analytics (UBA). Splunk ES is Splunk's award-winning security intelligence solution that brings immediate value for continuous monitoring across SOC and incident response environments – allowing you to quickly detect and respond to external and internal attacks, simplifying threat management while decreasing risk. Splunk UBA is a new technology that applies unsupervised machine learning and data science to solving one of the biggest problems in information security today: insider threat. You’ll learn how Splunk UBA works in tandem with ES, or third-party data sources, to bring significant automated analytical power to your SOC and Incident Response teams. We’ll discuss each solution and see them integrated and in action through detailed demos.

Splunk for Enterprise Security and User Behavior Analytics

Splunk

Webinar: Which Storage Architecture is Best for Splunk Analytics?

Storage Switzerland

[Webinar] Discover the Data Behind the Gambling Industry’s Online Marketing

SimilarWeb - Digital Insights

PPT-Splunk-LegacySIEM-101_FINAL

Risi Avila

Rob peglar introduction_analytics _big data_hadoop

Ghassan Al-Yafie

Managing big data and running supercomputing jobs used to be for only well-funded research organizations and large corporations, but not any longer. AWS has democratized supercomputing and big data for the masses! AWS can provide you with the 64th fastest supercomputer in the world, on-demand and pay as you go. Hear from Ben Butler, Head of AWS Big Data Marketing, to learn how our customers are using big data and high performance computing to change the world. Not only is AWS technology available to everyone, but it is self-service and cheaper than ever before, featuring innovative technology and flexible pricing models – our AWS cloud computing platform has disrupted big data and HPC. Learn from customer successes, as Ben shares real-world case studies describing the specific big data and high performance computing challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the tools needed to get you started quickly.

Big Data and High Performance Computing Solutions in the AWS Cloud

Amazon Web Services

Tableau desktop & server

Chris Raby

Teradata introduction - A basic introduction for Taradate system Architecture

Mohammad Tahoon

Teradata Architecture

BigClasses Com

Destaque (20)

Tableau AWS EC2 integration architecture diagram

High Performance Big Data Loading for AWS: Deep Dive and Best Practices from ...

Big data performance management thesis

Big Data to your advantage with High-Performance Analytics

SplunkSummit 2015 - Real World Big Data Architecture

Performance Management in ‘Big Data’ Applications

EMC Big Data Solutions Overview

Taking Splunk to the Next Level - Architecture Breakout Session

Towards a Systematic Study of Big Data Performance and Benchmarking

Tableau Architecture

Splunk for Security: Background & Customer Case Study

Splunk for Enterprise Security and User Behavior Analytics

Webinar: Which Storage Architecture is Best for Splunk Analytics?

[Webinar] Discover the Data Behind the Gambling Industry’s Online Marketing

PPT-Splunk-LegacySIEM-101_FINAL

Rob peglar introduction_analytics _big data_hadoop

Big Data and High Performance Computing Solutions in the AWS Cloud

Tableau desktop & server

Teradata introduction - A basic introduction for Taradate system Architecture

Teradata Architecture

Semelhante a Leveraging your hadoop cluster better - running performant code at scale

Scaling ETL with Hadoop - Avoiding Failure

Gwen (Chen) Shapira

WisdomEye Technologies

Ashish Jha

WisdomEye Technologies

wisdomeye

Hadoop and Mapreduce for .NET User Group

Csaba Toth

Hadoop at Yahoo! -- University Talks

yhadoop

Dataiku hadoop summit - semi-supervised learning with hadoop for understand...

Dataiku

Scott Callaghan from the Southern California Earthquake Center presented this deck in a recent Blue Waters Webinar. "I will present an overview of scientific workflows. I'll discuss what the community means by "workflows" and what elements make up a workflow. We'll talk about common problems that users might be facing, such as automation, job management, data staging, resource provisioning, and provenance tracking, and explain how workflow tools can help address these challenges. I'll present a brief example from my own work with a series of seismic codes showing how using workflow tools can improve scientific applications. I'll finish with an overview of high-level workflow concepts, with an aim to preparing users to get the most out of discussions of specific workflow tools and identify which tools would be best for them." Watch the video: http://wp.me/p3RLHQ-gtH Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Overview of Scientific Workflows - Why Use Them?

inside-BigData.com

Data science and Hadoop

Donald Miner

Real time monitoring of hadoop and spark workflows

Shankar Manian

Hadoop for Data Science

Donald Miner

Hadoop

Yojana Nanaware

How do you determine whether your MongoDB Atlas cluster is over provisioned, whether the new feature in your next application release will crush your cluster, or when to increase cluster size based upon planned usage growth?  MongoDB Atlas provides over a hundred metrics enabling visibility into the inner workings of MongoDB performance, but how do apply all this information to make capacity planning decisions? This presentation will enable you to effectively analyze your MongoDB performance to optimize your MongoDB Atlas spend and ensure smooth application operation into the future.

MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...

MongoDB

Optimizing thread performance for a genomics variant caller

AllineaSoftware

Big Data and Hadoop

ch adnan

Hadoop

Ramakrishna Reddy Bijjam

Taboola Road To Scale With Apache Spark

tsliwowicz

Intro to BigData , Hadoop and Mapreduce

Krishna Sangeeth KS

How to see everything in our perimeter? Better yet, how to make sure everyone can follow the activity of their databases? Developers are not used to watch (closely) their databases performance, how to share this knowledge and make it accessible? This is the challenge we have set ourselves. One year later we can share our experience. Scaling is not always throwing resource on a design issue. What if observability was not just a buzzword, but had a real impact on production?

Improve your SQL workload with observability

OVHcloud

Hadoop training-in-hyderabad

sreehari orienit

Hadoop Summit 2016 presentation. As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.

Faster Faster Faster! Datamarts with Hive at Yahoo

Mithun Radhakrishnan

Semelhante a Leveraging your hadoop cluster better - running performant code at scale (20)

Scaling ETL with Hadoop - Avoiding Failure

WisdomEye Technologies

Hadoop and Mapreduce for .NET User Group

Hadoop at Yahoo! -- University Talks

Dataiku hadoop summit - semi-supervised learning with hadoop for understand...

Overview of Scientific Workflows - Why Use Them?

Data science and Hadoop

Real time monitoring of hadoop and spark workflows

Hadoop for Data Science

Hadoop

MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...

Optimizing thread performance for a genomics variant caller

Big Data and Hadoop

Hadoop

Taboola Road To Scale With Apache Spark

Intro to BigData , Hadoop and Mapreduce

Improve your SQL workload with observability

Hadoop training-in-hyderabad

Faster Faster Faster! Datamarts with Hive at Yahoo

Último

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Leveraging your hadoop cluster better - running performant code at scale

1. Leveraging your Hadoop cluster better running efficient code at scale Michael Kopp, Technology Strategist

2. Why do I do this talk? 2

4. Effectiveness vs. Efficiency • Effective: Adequate to accomplish a purpose; producing the intended or expected result1 • Efficient: Performing or functioning in the best possible manner with the least waste of time and effort1 …and resources 1) http://www.dailyblogtips.com/effective-vs-efficient-difference/

5. An Efficient Hadoop Cluster • Is Effective  Gets the job done (in time) • Highly Utilized when Active (unused resources are wasted resources)

6. What is an efficient Hadoop Job? …efficiency is a measurable concept, quantitatively determined by the ratio of output to input… • same output in less time • less resource usage with same output and same time • more output with same resources in the same time Efficient jobs are effective without adding more hardware!

7. Efficiency – Using everything we have

8. Utilization and Distribution CPU Spikes but no real overall usage Not fully utilized

9. Reasons why your Cluster is not utilized • Map and Reduce Slots • Data Distribution • Bottlenecks – Spill – Shuffle – Reduce – Code

10. Which Job(s) are dominating the cluster?

11. Which User? Which Pool?

12. Pushing the Boundaries – High Utilization • Figure out Spill and Shuffle Bottlenecks • Remove Idle Times, Wait Times, Sync Times • Hotspot Analysis Tools can pinpoint those Items quickly

13. Identify the Jobs

14. Job Bottlenecks – Mapping Phase Mapper is waiting for Spill Thread io.sort.spill.percent io.sort.mb Wait Time?

15. Job Bottleneck - Shuffle Reducer is Waiting for memory mapred.job.shuffle.input.buffer.percent mapred.job.reduce.total.mem.bytes mapred.reduce.parallel.copies Wait Time?

16. Cluster after simple “Fixes”

17. Jobs are now resource bound

18. Efficiency – Use what we have better

19. Performance Optimization 1. Identify Bounding Resource 2. Optimize and reduce its usage 3. Identify new Bounding Resource Hot Spot Analysis Tools are again the best way to go

20. Identify Hotspots – which Phase

21. Cluster Usage

22. Mapping Phase Hotspot in Outage Analyzer 70% our own code!

23. CPU Hotspot in Mapping Phase 10% of Mapping CPU 20% of Mapping CPU

24. Hotspot Analysis of Reduce Phase Wow!

25. Three simple Hotspots

26. Before Fix: 6h 30 minutes…

27. …After Fix: 3.5 hours Utilization went up!

28. Map Reduce Run Comparison 10% of Mapping CPUReducers Running3 Reducers running

29. Conclusion • Understanding your bottleneck! • Understand bounding resource • Small fixes can have huge yields…but requires tools

30. What else did we find? • Short Mappers due to small files – High merge time due to large number of spills – Too much data shuffle  add Combiner but… • Tried Task reuse – Nearly not effect? – 5% less Map Time, but…?

31. Why did the resuse not help Map Phase over 5 more reducersshuffle

32. What’s next? • Bigger Files • Add Combiners to reduce shuffle

33. What about Hive or PIG? • Identify which stage the is slow • Identify configuration Issues • Identify HBase or UDF issues

34. HBase PIG Job lasting for 15 hours…

35. HBase major Hotspot… Wow! Roundtrip for every single Row

36. Cluster Utilization after fix

37. Performance after Fix: 75 minutes!

38. Summary • Drive up utilization • Remove Blocks and Sync points • Optimize Big Hotspots

39. Michael Kopp, Technology Strategist michael.kopp@compuware.com @mikopp apmblog.compuware.com javabook.compuware.com

Notas do Editor

Why did I do this talk, well this is it.
In other words, from a cluster perspective efficency means using every resource available! Not being idle.
I could simply add more map and reduce slots and try to pound the cluster. But that might not be really good for all jobs and further more at some point I will run into load average issues, meaning too much scheduling and becomes counter productive.
We want to figure out which Jobs are running, which consume most most of my cluster but at the same time don’t consume the resources. E.g. we can compare time vs. CPU time used by a job
The same we can do on a per user or pool basis. By using these two methods we quickly figure out which job/user occupies the cluster but is not running optimally. We will then look at those closer.
What do hotspot analysis tools do? Well if you are a developer you know what a Profiler is doing, it tells you where you spend most of your time and also CPU. The problem is that profilers can not be run distributed and they have a horrible impact on performance, they also distort hotspots if the hotspot is really a fast method called billions of times. In other words profilers are not useful for hadoop. Than there are CPU samplers. Better for hadoop, not so much impact, but again distribted is hard. Also Samplers miss context in the sense that they look at thread stack traces without the context of what’s going on. And than there are modern APM solutions, that provide the best of both worlds and then some. These solutions can deliver the value of a profiler and sample without the overhead, can be distributed and provide context.
You can use these to look at high level hotspots of a job. E.g. this was a job that ran for 6 hours total across 10 servers in EC2. Now this does not show me every little detail, I don’t care about that. But it shows me the big hotspots, and for that it gives me detaul. E.g. that blue block 9 hours out of 65 hours accounted time
I can also go the other way around BTW, let’s say I see that my Cluster is spending a lot of time waiting, I can easyily figure out which jobs are running of course, but better, I can simply do a hotspot to check what my Task JVMs are doing, and then have the APM solution tell me which job, user this is.
Add Number of Tasks per Job, Job Percentage Tracking.
Map Phase and Reduce phase are the same time. Looked at slots, and the reducer is not using the full cluster, but also it can’t. reducing cannot scale as much as mapping. We also see that the reduce phase drops off at the end for the last hour or so.So while mapping consumes a lot more time, reducing is a bottleneck so every optimization there will count twice! Let’s keep that in mind.
From 58h of Mapping Time to 48 hours
One was thealreadymentionedregex.Another was thatweinitialized a SimpleDateFormater for every observation aka. Map. Now that was a big issue, because not only was it creating the object each time, it was getting the locale, reading the resource boundle, calculating the current date and much much more. Why did the dev do it? because SimpleDateformater is not thread safe, so you cannot make it static very simply. Anyway this single thing amount to about half of our CPU usage! A third thing was that we are parsing data among other things numbers. An empty string is not a number and thus leads to a number format exception which we handled. However the simple fact that millions of these exceptions were thrown and cought amounted to 10% of our CPU time.We fixed these 3 simple issues, and our reduce phase was 6 times faster. To put it in perspective it went from 3 hours to 30 minutes on top of the map phase!
The files we were working on comprised 5 mintues of data, aka ~500MB uncompressed and 50MB compressed. Our average map time was only about 3-5 minutes. While that is not horrible it still means that we have considerable startup overheadMap Time came down from 2:35 to 2:30 which isn’t much, but the actual job time did not change at all and remained at little over three hours?=
First of all we see that before and after we are fully CPU bound, but actually its not easy to see here, but utilization improved. We were on 95-97 for the mapping phase before and are now at 98-99. really awesome.

Leveraging your hadoop cluster better - running performant code at scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (10)

Destaque

Destaque (20)

Semelhante a Leveraging your hadoop cluster better - running performant code at scale

Semelhante a Leveraging your hadoop cluster better - running performant code at scale (20)

Último

Último (20)

Leveraging your hadoop cluster better - running performant code at scale

Notas do Editor