Exploiting machine learning to keep Hadoop clusters healthy

Exploiting ML to keep
Hadoop Cluster Healthy
Dheeraj Kapur , Swetha Banagiri
Big Data Infrastructure Management Team

3
Agenda
Topic Speakers
Overview Dheeraj Kapur
Architecture Swetha Banagiri
Q&A All Presenters

5
Zookeeper
Backend
Support
Hadoop
Storage
Hadoop
Compute
Hadoop
Services
Support
Shop
Monitoring
Starling for
logging
HDFS Hbase as
NoSql store
Hcatalog for
metadata
registry
YARN (Mapred) and Tez
for Batch processing
Storm for stream
processing
Spark for iterative
programming
PIG for ETL
Hive for
SQL
Oozie for
workflows
Proxy
services
GDM for data
Mang
Café on
Spark for ML
Grid Stack

6
Challenge
● Oath has one of the largest footprints of Hadoop/Storm software frameworks
● Computing environment includes about 50,000+ nodes
● Nodes spread across ~40 clusters
● Largest cluster of Hadoop comprises of >5k nodes
● SLA driven, time sensitive jobs
● To operate and meet SLA, we require 90Mbps per disk throughput

7
● Performance degredation
● Data corruption
● Shuffle slowness result into pipeline failures.
● Task slowness as a result of datanode slowness or replication failures
● Slowness in jobs become critical performance bottleneck.
● Becomes huge bottleneck, when speculative execution can’t be turned on.
Impact of Disk Failures

● External Factors - Temperature, Power Outages
● Internal Factors - File Corruption, Drive read instability, Aging
● Prone to mechanical failure because of moving parts
8
Factors causing disk failures

9
Proactive better than Reactive
● Avoids a bad disk being the performance bottleneck
● Avoids running tight SLA bound jobs on a bad node
● Avoids pipeline failure and block corruption
● Reduces revenue loss due to SLA misses
● With the DFP system enabled across the clusters, the hosts will have a higher
uptime

10
EXPLOITING
MACHINE
LEARNING
TO
PREDICT
BAD DISKS

11
ARCHITECTURE - PREDICTION SYSTEM

12
Elastic Stack 1/3
● Centralized Data Collection System
● Master - Slave and push architecture
● Master helps in redirecting documents to data nodes
● Data is pushed as json documents using python code
● All documents are stored within an index
● Each key in a json document is called as a field.
Continued……

13
● Data is distributed across the datanodes, each housing number of
shards under single index
● API used to store/retrieve data.
● With Kibana as the frontend, building a dashboard for visualizing
collected data is easier
curl -XGET <hostname>:<port>/<index_name>/_search?pretty
Elastic Stack 2/3

14
Index
Fields
Document
Elastic Stack 3/3

15
What are the symptoms of a disk being
bad?

16
S.M.A.R.T. Stats
1/2
● Self-Monitoring, Analysis and Reporting Technology
● Gives report on the internal information about a drive
● Drive fails immediately or it shows some symptoms before it fails
● The symptoms are recorded by S.M.A.R.T. tool
● S.M.A.R.T. stats are inconsistent from hard drive to hard drive.
Continued……

17
Following are the S.M.A.R.T. stats used for prediction
SMART 5 Reallocated_Sector_Count
SMART 187 Reported_Uncorrectable_Errors
SMART 188 Command_Timeout
SMART 197 Current_Pending_Sector_Count
SMART 198 Offline_Uncorrectable
S.M.A.R.T. Stats
2/2

18
Pre Processing the data
● Data collected from various nodes fall under different disk models
● Each node is grouped based on the disk model in which the drive
belongs to
● Data is ignored when all the five stats are 0

19
Labelling the data
● Very important and cumbersome task
● Labelled ~4000 nodes across the disk models
● Nodes are classified as Good, Fair, Bad
● High values for a S.M.A.R.T stat means that the node is bad

20
Feed Forward Neural Network Model

21
● Fully connected
● 4 layer deep neural network model
● ‘adam’ optimizer used for Back Propagation
● Three hidden layers use ‘relu’ activation function
● Output layer is a ‘sigmoid’ activation function
● Loss is calculated using ‘binary-crossentropy’
Feed Forward Neural Network

22
Training - Dataset and Accuracy
Results

23
Testing - Dataset and Accuracy
Results

24
● Number of bad nodes are very less compared to good nodes
● Small dataset
● Fine tuning the training parameters
Challenges

Exploiting machine learning to keep Hadoop clusters healthy

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Exploiting machine learning to keep Hadoop clusters healthy

Semelhante a Exploiting machine learning to keep Hadoop clusters healthy (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Exploiting machine learning to keep Hadoop clusters healthy