This session will answer frequently asked questions about Hadoop, and share proven ways you can overcome challenges in deploying, managing, and tuning Hadoop environments. The discussion topics will include Hadoop operations, configuration management, upgrades and lifecycle management, monitoring and managing power and heat, and Hadoop performance tuning, testing, and optimization. The presenters will also discuss how rapid Hadoop deployment makes life easier for administrators, and talk about Crowbar, an open source Operations Framework.
2. Intros and Bios
• Joey Jablonski
– Dell Principal Solution Architect
– http://www.linkedin.com/in/joeyjablonski
– http://mergingbusinessandit.com
• Vin Sharma
– Intel Open Source Enterprise Strategist
– http://www.linkedin.com/in/c1f3rt3xt
– http://www.intel.com/opensource
3. Agenda
• Why Hadoop is difficult for IT to operate
• How the right tools can make this easier
– Deployment & Configuration with Dell Crowbar
– Monitoring and Management with Dell Barclamps
– Performance Tuning with Intel HiTune
5. Operational Challenges
• Deployment
– Complex because of scale (60 nodes to 1000 nodes)
– Cumbersome because of high-touch processes
• Configuration & Tuning
– Error-prone configuration management
– State management
• Monitoring and Management
– Complex troubleshooting and diagnostics
– No proactive notification of problems
• Performance Optimization
– Limitations of traditional tools
7. Three aspects of revolutionary clouds
Two Sides of Cloud
Ecosystem
+API Cloud = Operations
Ops
Black Box HW
OPS
CloudOps
SW
APIs
Cloud
Ops
O/S
Physical
8. Images vs. Layers: Overview
Images: Single Unit Layers: Stacked Pieces
Configuration Integrations
Application Foo
Configuration
Integrations + Application Bar
Applications +
Utilities + Operating Utilities
System
Operating System
9. Images vs. Layers: Lifecycle
Images: Replacement Layers: Upgrade
Config Config Config
I I
Foo Foo
Config
Config
I+A+U+O I+A+U+O I+A+U+O
Bar v1 Bar v2
/S /S /S
U U
OS OS
Config Bar v2
I+A+U+O
/S
12. Cloud = Ops
We have capable hardware & software, the real question is
how are we going to operate it as a service?
• This is CloudOps
OPS
HW • Software mindset to infrastructure
• Software is constantly changing
Cloud • Fluid resources instead of servers
SW
Ops • Manual touch is unacceptable
Ultimately, all the rules for operating the data center become
encoded as automation software.
14. Platform Selection
Dell PowerEdge C2100 for Hadoop based on Intel® Xeon®
Dell PowerEdge C2100
• Designed with big data in mind
• Compact 2U form factor
• 2-socket 6-core
• Intel® Xeon® 5620 processor
• High performance memory system
• Expansive disk storage Recommended Configuration
• Intel Xeon Processor 5600 series
• 4-6 1TB or 2TB 7200 RPM SATA SSD
• 12-24GB DDR3 R-ECC RAM
• 1-2 dual-port 1GigE
• Linux kernel 2.6.30 or later
• Sun Java 6u14 or later
• Hadoop version 0.20.x or later
Intel Whitepaper: “Optimizing Hadoop Deployments” (http://software.intel.com/file/31124)
15. So what seems to be the problem?
• Dataflow and high level
abstraction make it difficult
to understand runtime
behaviors
• Large distributed system
makes it difficult to correlate
concurrent performance-
related activities
16. HiTune: Hadoop Performance Analyzer
• Collects metrics from each node
• Aggregates data using Chukwa
• Analyzes results using Hadoop
• Generates reports for visualization
• System metric (CPU, Disk I/O, Network IO, Memory)
• Hadoop metrics
(NameNode, DataNode, JobTracker, TaskTracker, JVM metrics)
• Dataflow based statistics
(Job, MapTasks, Reduce Tasks, Threaddump for M/R)
• Summary view of a single job
• Summary view by comparing multiple jobs
Apache 2.0 License
17. HiTune Architecture
Sampler
• Tracker Sampler
Task
Sampler
Task
Sampler
– Lightweight agent running on each node Task Task Sampler
Sampler
Task Task
Sampler
Sampler Tracker
in the Hadoop cluster
Task
Sampler
Task
Task
Tracker
• Sysstat, Hadoop logs and metrics, Java Tracker
Sampler
instrumentation Task
Sampler
Task
Sampler
Task
Tracker
• Aggregation engine
Aggregation engine
– Merges the results of all the trackers in a Analysis engine
distributed fashion
Specification file
• Analysis engine
– Generates reports based on data flow
model Dataflow diagram
18. Case Study
Partitioned
Input Map Tasks Reduce Tasks
D
map
spill Aggregated
shuffle Output
copier
merge sort reduce
A map spill
shuffle
copier
merge sort
reduce
T map Spill
shuffle
copier
merge sort
map reduce
A spill
Streaming dataflow Sequential dataflow
Terasort with zlib
• Large gap between end of map and end of shuffle
• No CPU, I/O, or network bandwidth bottlenecks
• Adding copiers does not change “shuffle fetchers busy percent” = 100
Terasort with LZO
• Copier threads idle 80% waiting for memory merge threads
• Memory merge threads busy mostly due to compression
• Changing compression codec to LZO closes the gap
• Improves job running time by 2.3x
19. Have at it
• Pull Crowbar
– https://github.com/dellcloudedge/crowbar
• Pull HiTune
– https://github.com/HiTune/HiTune
Hadoop Operations (10-min)Struggles and Challenges (Dell)Operations Framework (25 min)Dev Ops inspired operations framework (Dell)Crowbar (Dell)Monitoring and Management (Intel)Power & Cooling (Dell)Hadoop Lifecycle Management (10-min)Performance Testing - HiTune (Intel)Hadoop Tuning (Intel)
For NoSQL data warehouses using Hadoop, you can see the benefit of modern servers versus legacy. On these two tests, the Xeon 5600-based server cluster significantly outperformed the legacy server cluster, and offered many more features and greater energy-efficiency than the older model.It pays to optimize around the right hardware. Legacy servers will forego a lot of performance and energy efficiency, potentially limiting the SLA, number of users and amount of data that can be processed for analysis.• Intel® Xeon® 5600 improves Hadoop Workload performance• Choosing an optimized server board can reduce power consumption• Use Intel® X25-E SATA SSDs to improve performanceSoftware & configurations:• Use latest Linux kernel• Turn on Intel® Hyper-threading• Optimize Hadoop Configuration• Tuning may be different for different workload types