Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics. DHEERAJ KAPUR, Principal Engineer, Oath and SWETHA BANAGIRI
6. 6
Challenge
● Oath has one of the largest footprints of Hadoop/Storm software frameworks
● Computing environment includes about 50,000+ nodes
● Nodes spread across ~40 clusters
● Largest cluster of Hadoop comprises of >5k nodes
● SLA driven, time sensitive jobs
● To operate and meet SLA, we require 90Mbps per disk throughput
7. 7
● Performance degredation
● Data corruption
● Shuffle slowness result into pipeline failures.
● Task slowness as a result of datanode slowness or replication failures
● Slowness in jobs become critical performance bottleneck.
● Becomes huge bottleneck, when speculative execution can’t be turned on.
Impact of Disk Failures
8. ● External Factors - Temperature, Power Outages
● Internal Factors - File Corruption, Drive read instability, Aging
● Prone to mechanical failure because of moving parts
8
Factors causing disk failures
9. 9
Proactive better than Reactive
● Avoids a bad disk being the performance bottleneck
● Avoids running tight SLA bound jobs on a bad node
● Avoids pipeline failure and block corruption
● Reduces revenue loss due to SLA misses
● With the DFP system enabled across the clusters, the hosts will have a higher
uptime
12. 12
Elastic Stack 1/3
● Centralized Data Collection System
● Master - Slave and push architecture
● Master helps in redirecting documents to data nodes
● Data is pushed as json documents using python code
● All documents are stored within an index
● Each key in a json document is called as a field.
Continued……
13. 13
● Data is distributed across the datanodes, each housing number of
shards under single index
● API used to store/retrieve data.
● With Kibana as the frontend, building a dashboard for visualizing
collected data is easier
curl -XGET <hostname>:<port>/<index_name>/_search?pretty
Elastic Stack 2/3
16. 16
S.M.A.R.T. Stats
1/2
● Self-Monitoring, Analysis and Reporting Technology
● Gives report on the internal information about a drive
● Drive fails immediately or it shows some symptoms before it fails
● The symptoms are recorded by S.M.A.R.T. tool
● S.M.A.R.T. stats are inconsistent from hard drive to hard drive.
Continued……
17. 17
Following are the S.M.A.R.T. stats used for prediction
SMART 5 Reallocated_Sector_Count
SMART 187 Reported_Uncorrectable_Errors
SMART 188 Command_Timeout
SMART 197 Current_Pending_Sector_Count
SMART 198 Offline_Uncorrectable
S.M.A.R.T. Stats
2/2
18. 18
Pre Processing the data
● Data collected from various nodes fall under different disk models
● Each node is grouped based on the disk model in which the drive
belongs to
● Data is ignored when all the five stats are 0
19. 19
Labelling the data
● Very important and cumbersome task
● Labelled ~4000 nodes across the disk models
● Nodes are classified as Good, Fair, Bad
● High values for a S.M.A.R.T stat means that the node is bad
21. 21
● Fully connected
● 4 layer deep neural network model
● ‘adam’ optimizer used for Back Propagation
● Three hidden layers use ‘relu’ activation function
● Output layer is a ‘sigmoid’ activation function
● Loss is calculated using ‘binary-crossentropy’
Feed Forward Neural Network