3. Putting wings on the Elephant!
Pritam Damania
Software Engineer
April 2, 2014
4. 1 Background
2 Major Issues in I/O path
3 Read Improvements
4 Write Improvements
5 Lessons learnt
Agenda
5. High level Messages Architecture
HBASE
Application
Server
Messag
e
Messag
e
AckWrite
6. Hbase Cluster Physical Layout
▪ Multiple clusters/cells for messaging
▪ 20 servers/rack; 5 or more racks per cluster
Rack #1
ZooKeeper Peer
HDFS Namenode
Region Server
Data Node
Task Tracker
19x...
Region Server
Data Node
Task Tracker
Rack #2
ZooKeeper Peer
Standby Namenode
Region Server
Data Node
Task Tracker
19x...
Region Server
Data Node
Task Tracker
Rack #3
ZooKeeper Peer
Job Tracker
Region Server
Data Node
Task Tracker
19x...
Region Server
Data Node
Task Tracker
Rack #4
ZooKeeper Peer
HBase Master
Region Server
Data Node
Task Tracker
19x...
Region Server
Data Node
Task Tracker
Rack #5
ZooKeeper Peer
Backup HBase Master
Region Server
Data Node
Task Tracker
19x...
Region Server
Data Node
Task Tracker
12. Disk Skew
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
• HDFS block size : 256MB
• HDFS block resides on single disk
• Fsync of 256MB hitting single disk
13. Disk Skew - Sync File Range
………………………………………………………………………………………………..
64k 64k 64k 64k
sync_file_range every 1MB
▪ sync_file_range(SYNC_FILE_RANGE_WRITE)
▪ Initiates Async write
Block File Written on
Linux FileSystem
64k 64k
fsync
14. High IOPS
• Messages workload is random read
• Small preads (~4KB) on datanodes
• Two iops for each pread
Datanode
Block File Checksum file
prea
d
Read checksum
Read data
15. High IOPS - Inline Checksums
……………………
…………………………………
4096 byte Data Chunk
4 byte Checksum
• Checksums inline with data
• Single iop for pread
HDFS Block
16. High IOPS - Results
No. of Put
and get
above one
second
Put
avg
time
Get
avg
time
17. Hbase Locality - HDFS Favored Nodes
▪ Each region’s data on 3 specific datanodes
▪ On failure locality preserved
▪ Favored nodes persisted at hbase layer
RegionServer
Local Datanode
18. Hbase Locality - Solution
• Persisting info in NameNode complicated
• Region Directory :
▪ /*HBASE/<tablename>/<regionname>/cf1/…
▪ /*HBASE/<tablename>/<regionname>/cf2/…
• Build Histogram of locations in directory
• Pick lowest frequency to delete 0
5000
10000
Datanodes
D1
D2
D3
D4
21. Hbase WAL
Datanode
OS page cache
Disk
Regionserver
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
• Packets never hit disk
• > 1s outliers !
22. Instrumentation
1. Write to OS cache
2. Write to TCP buffers
3. sync_file_range(SYNC_FILE_RANGE_WRITE)
1. & 3. outliers >1s !
24. Interesting Observations
• write(2) outliers correlated with busy disk
• Reproducible by artificially stressing disk
dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile
bs=256M count=1000
25. Test Program
File Written on
Linux FileSystem
……………………………………………………………………………………..
64k 64k 64k 64k
sync_file_range every 1MB
64k 64k
………………………………………………………………………………………………..
63k 1k 63k 1k
sync_file_range every 1MB
63k 1k
No Outliers
!
Outliers Reproduced !
26. Some suspects
• Too many dirty pages
• Linux stable pages
• Kernel trace points revealed stable pages the culprit
27. Stable Pages
Persistent Store
(Device with Integrity Checking)
OS page
Kernel
Checksum
Device
Checksum
WriteBack
• Checksum Error
• Solution – Lock
pages under
writeback
28. Explanation of Write Outliers
Persistent Store
OS Page
4k
WAL write
WriteBack
(sync_file_range)
WAL write
blocked