Somebody once said that hadoop is a way of running highly unperformant code at scale. In this talk I want to show how we can change that and make map reduce jobs more performant. I will show how to analyze them at scale and optimize the job itself, instead of just tinkering with hadoop options. The result is a much better utilized cluster and jobs that run in a fraction of the original time running performant code at scale! Most of the time when speaking about Hadoop people only consider scale, however, when looking at it it very often runs highly unperformant jobs. By actually looking at the performance characteristics of the jobs themselves and optimizing and tuning those far better results can be achieved. Examples include small changes that cut jobs down from 15 hours to 2 hours without adding any more hardware. The concepts and techniques explained in the talk will be applicable regardless which tool is used to identify the performance characteristics, what is important is that by applying performance analysis and optimization techniques that we have used on other applications for a long time we can make hadoop jobs much more effective and performant! The attendees will be able to understand those techniques and apply them to their map/reduce/PIG/hive or other mapreduce jobs.
4. Effectiveness vs. Efficiency
• Effective: Adequate to accomplish a purpose; producing the
intended or expected result1
• Efficient: Performing or functioning in the best possible
manner with the least waste of time and effort1
…and resources
1) http://www.dailyblogtips.com/effective-vs-efficient-difference/
5. An Efficient Hadoop Cluster
• Is Effective Gets the job done (in time)
• Highly Utilized when Active (unused resources are wasted
resources)
6. What is an efficient Hadoop Job?
…efficiency is a measurable concept,
quantitatively determined
by the ratio of output to input…
• same output in less time
• less resource usage with same output
and same time
• more output with same resources
in the same time
Efficient jobs are effective without
adding more hardware!
12. Pushing the Boundaries – High Utilization
• Figure out Spill and Shuffle Bottlenecks
• Remove Idle Times, Wait Times, Sync Times
• Hotspot Analysis Tools can pinpoint those Items quickly
19. Performance Optimization
1. Identify Bounding Resource
2. Optimize and reduce its usage
3. Identify new Bounding Resource
Hot Spot Analysis Tools are again the best way to go
28. Map Reduce Run Comparison
10% of Mapping CPUReducers Running3 Reducers running
29. Conclusion
• Understanding your bottleneck!
• Understand bounding resource
• Small fixes can have huge yields…but requires tools
30. What else did we find?
• Short Mappers due to small files
– High merge time due to large number of spills
– Too much data shuffle add Combiner but…
• Tried Task reuse
– Nearly not effect?
– 5% less Map Time, but…?
31. Why did the resuse not help
Map Phase over
5 more reducersshuffle
38. Summary
• Drive up utilization
• Remove Blocks and Sync points
• Optimize Big Hotspots
39. Michael Kopp, Technology Strategist
michael.kopp@compuware.com
@mikopp
apmblog.compuware.com
javabook.compuware.com
Notas do Editor
Why did I do this talk, well this is it.
In other words, from a cluster perspective efficency means using every resource available! Not being idle.
I could simply add more map and reduce slots and try to pound the cluster. But that might not be really good for all jobs and further more at some point I will run into load average issues, meaning too much scheduling and becomes counter productive.
We want to figure out which Jobs are running, which consume most most of my cluster but at the same time don’t consume the resources. E.g. we can compare time vs. CPU time used by a job
The same we can do on a per user or pool basis. By using these two methods we quickly figure out which job/user occupies the cluster but is not running optimally. We will then look at those closer.
What do hotspot analysis tools do? Well if you are a developer you know what a Profiler is doing, it tells you where you spend most of your time and also CPU. The problem is that profilers can not be run distributed and they have a horrible impact on performance, they also distort hotspots if the hotspot is really a fast method called billions of times. In other words profilers are not useful for hadoop. Than there are CPU samplers. Better for hadoop, not so much impact, but again distribted is hard. Also Samplers miss context in the sense that they look at thread stack traces without the context of what’s going on. And than there are modern APM solutions, that provide the best of both worlds and then some. These solutions can deliver the value of a profiler and sample without the overhead, can be distributed and provide context.
You can use these to look at high level hotspots of a job. E.g. this was a job that ran for 6 hours total across 10 servers in EC2. Now this does not show me every little detail, I don’t care about that. But it shows me the big hotspots, and for that it gives me detaul. E.g. that blue block 9 hours out of 65 hours accounted time
I can also go the other way around BTW, let’s say I see that my Cluster is spending a lot of time waiting, I can easyily figure out which jobs are running of course, but better, I can simply do a hotspot to check what my Task JVMs are doing, and then have the APM solution tell me which job, user this is.
Add Number of Tasks per Job, Job Percentage Tracking.
Map Phase and Reduce phase are the same time. Looked at slots, and the reducer is not using the full cluster, but also it can’t. reducing cannot scale as much as mapping. We also see that the reduce phase drops off at the end for the last hour or so.So while mapping consumes a lot more time, reducing is a bottleneck so every optimization there will count twice! Let’s keep that in mind.
From 58h of Mapping Time to 48 hours
One was thealreadymentionedregex.Another was thatweinitialized a SimpleDateFormater for every observation aka. Map. Now that was a big issue, because not only was it creating the object each time, it was getting the locale, reading the resource boundle, calculating the current date and much much more. Why did the dev do it? because SimpleDateformater is not thread safe, so you cannot make it static very simply. Anyway this single thing amount to about half of our CPU usage! A third thing was that we are parsing data among other things numbers. An empty string is not a number and thus leads to a number format exception which we handled. However the simple fact that millions of these exceptions were thrown and cought amounted to 10% of our CPU time.We fixed these 3 simple issues, and our reduce phase was 6 times faster. To put it in perspective it went from 3 hours to 30 minutes on top of the map phase!
The files we were working on comprised 5 mintues of data, aka ~500MB uncompressed and 50MB compressed. Our average map time was only about 3-5 minutes. While that is not horrible it still means that we have considerable startup overheadMap Time came down from 2:35 to 2:30 which isn’t much, but the actual job time did not change at all and remained at little over three hours?=
First of all we see that before and after we are fully CPU bound, but actually its not easy to see here, but utilization improved. We were on 95-97 for the mapping phase before and are now at 98-99. really awesome.