The document describes a Hadoop cluster management system used at Yahoo. It provides modular and distributed workflow-based system for managing Hadoop clusters. The system allows for seamless management of clusters including break fixing, upgrades, and ensuring consistency and efficiency. It introduces a proactive self-healing model to identify and fix underperforming nodes. The system includes a WebUI and command line utility for administration and user management.
2. What it is…
§ Workflow based system for cluster management.
§ Completely modular & distributed design.
§ Has its own JMX based library(can be used to monitor other
services on cluster).
§ Fully controllable from WebUI.
§ Has command line utility for adhoc administration.
2
3. What it does…
§ Manage clusters.
§ Break fixing.
§ Upgrades OS seamlessly.
§ Consistency/efficiency of clusters.
§ Proactive self-healing Model.
§ User Management.
3
4. Manage Clusters
§ Its has well defined workflow to manage clusters.
§ No/Minimal human intervention required.
§ Keep up efficiency of cluster.
§ Keep track of Missing/Bad blocks on system.
§ Well defined WebUI and Command line utility
4
11. fixing bad/mal-performing nodes
These errors can lead to SLA miss or Job failures
§ Takes care of Blacklisted JT nodes.
§ Errors like high load average, wrong network speed.
§ Parse system logs at X frequency (thru workflows) and look for
patterns.
§ Visit each node multiple times in a day and check health of node.
11
12. Upgrade OS
§ Upgrade & rollback OS seamlessly.
§ Upgrading on production, heavily used clusters.
12
13. Consistency & efficiency of clusters
§ Keep track of cluster MR capacity
§ Proactive Fixing of sick nodes, which can cause potential issues.
13
14. Introducing Proactive self-healing system
Let me set the ground for it.
§ Wounded hosts Called Set A - Hosts having issues, but still in service
(with degraded services), Which can cause potential SLA misses and
job execution issues.(which we have seen in past)
§ Fractured Hosts Called Set B - Hosts already in Break fix cycle and
getting fixed
§ All grid hosts Called Set X - all grid hosts healthy + fine
§ Set A & B are sub-set of set X
§ to find wounded hosts we have to scan entire infrastructure once a
day.
§ Calculate Symmetric difference b/w Set A & B, we will get actual
wounded hosts needs service.
14
17. User Management
§ We have one of the most complex and secure environment.
§ User access and management is a complex task, due to the
number of users, security constraints and complexity involved in
provisioning access.
§ Single request provisioning requires change at multiple places.
§ Well defined workflow based system, where 100% automation is
achieved.
§ Great help during system audit and compliance.
17