SlideShare uma empresa Scribd logo
1 de 20
Hadoop Cluster Management



                 Dheeraj Kapur
                 Principal Engineer, Yahoo!
                 dheerajk@yahoo-inc.com
What it is…
§  Workflow based system for cluster management.
§  Completely modular & distributed design.
§  Has its own JMX based library(can be used to monitor other
    services on cluster).
§  Fully controllable from WebUI.
§  Has command line utility for adhoc administration.




                                    2
What it does…
§    Manage clusters.
§    Break fixing.
§    Upgrades OS seamlessly.
§    Consistency/efficiency of clusters.
§    Proactive self-healing Model.
§    User Management.




                                       3
Manage Clusters
§    Its has well defined workflow to manage clusters.
§    No/Minimal human intervention required.
§    Keep up efficiency of cluster.
§    Keep track of Missing/Bad blocks on system.
§    Well defined WebUI and Command line utility




                                      4
System Overview




                  5
Workflow




           6
Contd..




          7
Command Line Utility




                   8
Web Interface




                9
Web Interface contd…




                  10
fixing bad/mal-performing nodes
These errors can lead to SLA miss or Job failures
§  Takes care of Blacklisted JT nodes.
§  Errors like high load average, wrong network speed.
§  Parse system logs at X frequency (thru workflows) and look for
    patterns.
§  Visit each node multiple times in a day and check health of node.




                                   11
Upgrade OS
§  Upgrade & rollback OS seamlessly.
§  Upgrading on production, heavily used clusters.




                                   12
Consistency & efficiency of clusters
§  Keep track of cluster MR capacity
§  Proactive Fixing of sick nodes, which can cause potential issues.




                                   13
Introducing Proactive self-healing system

Let me set the ground for it.
§  Wounded hosts Called Set A - Hosts having issues, but still in service
    (with degraded services), Which can cause potential SLA misses and
    job execution issues.(which we have seen in past)
§  Fractured Hosts Called Set B - Hosts already in Break fix cycle and
    getting fixed
§  All grid hosts Called Set X - all grid hosts healthy + fine
§  Set A & B are sub-set of set X
§  to find wounded hosts we have to scan entire infrastructure once a
    day.
§  Calculate Symmetric difference b/w Set A & B, we will get actual
    wounded hosts needs service.




                                    14
Proactive self-healing contd….
                  All Grid Hosts - X




                  Set A    Set B




                          15
Proactive self-healing contd….




                   16
User Management
§  We have one of the most complex and secure environment.
§  User access and management is a complex task, due to the
    number of users, security constraints and complexity involved in
    provisioning access.
§  Single request provisioning requires change at multiple places.
§  Well defined workflow based system, where 100% automation is
    achieved.
§  Great help during system audit and compliance.




                                   17
Q&A




 18
Thank You



    19
Sessions will resume at 4:30pm




                             Page 20

Mais conteúdo relacionado

Mais procurados

Ora10g Rac Best Practices
Ora10g Rac Best PracticesOra10g Rac Best Practices
Ora10g Rac Best Practices
vasanthkp
 

Mais procurados (20)

PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3
 
MySQL's new Secure by Default Install -- All Things Open October 20th 2015
MySQL's new Secure by Default Install -- All Things Open October 20th 2015MySQL's new Secure by Default Install -- All Things Open October 20th 2015
MySQL's new Secure by Default Install -- All Things Open October 20th 2015
 
Windows Server 2012 R2 Hyper-V Replica
Windows Server 2012 R2 Hyper-V ReplicaWindows Server 2012 R2 Hyper-V Replica
Windows Server 2012 R2 Hyper-V Replica
 
GlassFish v2 Clustering
GlassFish v2 ClusteringGlassFish v2 Clustering
GlassFish v2 Clustering
 
Ora10g Rac Best Practices
Ora10g Rac Best PracticesOra10g Rac Best Practices
Ora10g Rac Best Practices
 
My two cents about Mysql backup
My two cents about Mysql backupMy two cents about Mysql backup
My two cents about Mysql backup
 
SQL Server vs Postgres
SQL Server vs PostgresSQL Server vs Postgres
SQL Server vs Postgres
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQL
 
What's New in Postgres Plus Advanced Server 9.3
What's New in Postgres Plus Advanced Server 9.3What's New in Postgres Plus Advanced Server 9.3
What's New in Postgres Plus Advanced Server 9.3
 
MySQL Backup and Recovery Essentials
MySQL Backup and Recovery EssentialsMySQL Backup and Recovery Essentials
MySQL Backup and Recovery Essentials
 
Online MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackupOnline MySQL Backups with Percona XtraBackup
Online MySQL Backups with Percona XtraBackup
 
Sql server 2012 ha dr nova
Sql server 2012 ha dr novaSql server 2012 ha dr nova
Sql server 2012 ha dr nova
 
PGPool-II Load testing
PGPool-II Load testingPGPool-II Load testing
PGPool-II Load testing
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And Failover
 
Essential Linux Commands for DBAs
Essential Linux Commands for DBAsEssential Linux Commands for DBAs
Essential Linux Commands for DBAs
 
ProxySQL para mysql
ProxySQL para mysqlProxySQL para mysql
ProxySQL para mysql
 
MySQL Tuning
MySQL TuningMySQL Tuning
MySQL Tuning
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationPercona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
 
Sql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_finalSql server 2012 ha dr 24_hop_final
Sql server 2012 ha dr 24_hop_final
 

Destaque

빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
hslkdfjs
 

Destaque (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
빠르고쉬운대출『LG777』.『XYZ』무자본창업 운전자보험만기환급
 
New Use Cases for DAM in the Enterprise
New Use Cases for DAM in the EnterpriseNew Use Cases for DAM in the Enterprise
New Use Cases for DAM in the Enterprise
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
 
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
01_HTML - 작심10시간! 나만의 웹사이트 기획하고 만들기
 
Tailings dump recovery concept
Tailings dump recovery conceptTailings dump recovery concept
Tailings dump recovery concept
 
Polymer optical fibers
Polymer optical fibersPolymer optical fibers
Polymer optical fibers
 
SAP Cloud for Service
SAP Cloud for ServiceSAP Cloud for Service
SAP Cloud for Service
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure Management
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containers
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a Platform
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
High-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for AuditoriumsHigh-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for Auditoriums
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI Tool
 
Spend Analysis In 60 Seconds
Spend Analysis In 60 SecondsSpend Analysis In 60 Seconds
Spend Analysis In 60 Seconds
 
Surgical induced astigmatism
Surgical induced astigmatismSurgical induced astigmatism
Surgical induced astigmatism
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
 

Semelhante a Hadoop Cluster Management

1 introduction
1 introduction1 introduction
1 introduction
Mohd Arif
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 

Semelhante a Hadoop Cluster Management (20)

Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Highly Available Load Balanced Galera MySql Cluster
Highly Available Load Balanced  Galera MySql ClusterHighly Available Load Balanced  Galera MySql Cluster
Highly Available Load Balanced Galera MySql Cluster
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing Databases
 
1 introduction
1 introduction1 introduction
1 introduction
 
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin OrchestrateRun Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
 
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin OrchestrateRun Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
 
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin OrchestrateRun Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
 
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin OrchestrateRun Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
 
Run Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin OrchestrateRun Book Automation with PlateSpin Orchestrate
Run Book Automation with PlateSpin Orchestrate
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the Coin
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
Resilience Testing
Resilience Testing Resilience Testing
Resilience Testing
 
Bishwambar Linux Admin
Bishwambar Linux AdminBishwambar Linux Admin
Bishwambar Linux Admin
 
Ansible for networks
Ansible for networksAnsible for networks
Ansible for networks
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less Coffee
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
IaaS - Virtualization_Cambridge.pdf
IaaS - Virtualization_Cambridge.pdfIaaS - Virtualization_Cambridge.pdf
IaaS - Virtualization_Cambridge.pdf
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Hadoop Cluster Management

  • 1. Hadoop Cluster Management Dheeraj Kapur Principal Engineer, Yahoo! dheerajk@yahoo-inc.com
  • 2. What it is… §  Workflow based system for cluster management. §  Completely modular & distributed design. §  Has its own JMX based library(can be used to monitor other services on cluster). §  Fully controllable from WebUI. §  Has command line utility for adhoc administration. 2
  • 3. What it does… §  Manage clusters. §  Break fixing. §  Upgrades OS seamlessly. §  Consistency/efficiency of clusters. §  Proactive self-healing Model. §  User Management. 3
  • 4. Manage Clusters §  Its has well defined workflow to manage clusters. §  No/Minimal human intervention required. §  Keep up efficiency of cluster. §  Keep track of Missing/Bad blocks on system. §  Well defined WebUI and Command line utility 4
  • 11. fixing bad/mal-performing nodes These errors can lead to SLA miss or Job failures §  Takes care of Blacklisted JT nodes. §  Errors like high load average, wrong network speed. §  Parse system logs at X frequency (thru workflows) and look for patterns. §  Visit each node multiple times in a day and check health of node. 11
  • 12. Upgrade OS §  Upgrade & rollback OS seamlessly. §  Upgrading on production, heavily used clusters. 12
  • 13. Consistency & efficiency of clusters §  Keep track of cluster MR capacity §  Proactive Fixing of sick nodes, which can cause potential issues. 13
  • 14. Introducing Proactive self-healing system Let me set the ground for it. §  Wounded hosts Called Set A - Hosts having issues, but still in service (with degraded services), Which can cause potential SLA misses and job execution issues.(which we have seen in past) §  Fractured Hosts Called Set B - Hosts already in Break fix cycle and getting fixed §  All grid hosts Called Set X - all grid hosts healthy + fine §  Set A & B are sub-set of set X §  to find wounded hosts we have to scan entire infrastructure once a day. §  Calculate Symmetric difference b/w Set A & B, we will get actual wounded hosts needs service. 14
  • 15. Proactive self-healing contd…. All Grid Hosts - X Set A Set B 15
  • 17. User Management §  We have one of the most complex and secure environment. §  User access and management is a complex task, due to the number of users, security constraints and complexity involved in provisioning access. §  Single request provisioning requires change at multiple places. §  Well defined workflow based system, where 100% automation is achieved. §  Great help during system audit and compliance. 17
  • 19. Thank You 19
  • 20. Sessions will resume at 4:30pm Page 20