Facebook uses HBase running on HDFS to store messaging data and metadata. Key reasons for choosing HBase include high write throughput, horizontal scalability, and integration with HDFS. Typical clusters have multiple regions and racks for redundancy. Facebook stores small messages, metadata, and attachments in HBase, while larger messages and attachments are stored separately. The system processes billions of read and write operations daily and continues to optimize performance and reliability.
3. Why we chose HBase
âȘ High write throughput
âȘ Good random read performance
âȘ Horizontal scalability
âȘ Automatic Failover
âȘ Strong consistency
âȘ Benefits of HDFS
âȘ Fault tolerant, scalable, checksums, MapReduce
âȘ internal dev & ops expertise
4. What do we store in HBase
âȘ HBase
âȘ Small messages
âȘ Message metadata (thread/message indices)
âȘ Search index
âȘ Haystack (our photo store)
âȘ Attachments
âȘ Large messages
5. HBase-HDFS System Overview
Database Layer
HBASE
Master Backup
Master
Region Region Region ...
Server Server Server
Coordination Service
Storage Layer
HDFS Zookeeper Quorum
Namenode Secondary Namenode ZK ZK ...
Peer Peer
Datanode Datanode Datanode ...
6. Facebook Messages Architecture
User Directory Service
Clients
(Front End, MTA, etc.)
Whatâs the cell for
this user? Cell 2
Cell 1 Cell 3
Cell 1 Application Server
Application Server Application Server
Attachments
HBase/HDFS/Z
HBase/HDFS/Z Message, Metadata,
K Search HBase/HDFS/Z
Index
K K
Haystack
7. Typical Cluster Layout
âȘ Multiple clusters/cells for messaging
âȘ 20 servers/rack; 5 or more racks per cluster
âȘ Controllers (master/Zookeeper) spread across racks
ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer
HDFS Namenode Backup Namenode Job Tracker HBase Master Backup HBase Master
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
19x... 19x... 19x... 19x... 19x...
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
Rack #1 Rack #2 Rack #3 Rack #4 Rack #5
8. Facebook Messages: Quick Stats
âȘ 6B+ messages/day
âȘ Traffic to HBase
âȘ 75+ Billion R+W ops/day
âȘ At peak: 1.5M ops/sec
âȘ ~ 55% Read vs. 45% Write ops
âȘ Avg write op inserts ~16 records across multiple
column families.
9. Facebook Messages: Quick Stats (contd.)
âȘ 2PB+ of online data in HBase (6PB+ with replication;
excludes backups)
âȘ message data, metadata, search index
âȘ All data LZO compressed
âȘ Growing at 250TB/month
10. Facebook Messages: Quick Stats (contd.)
Timeline:
âȘ Started in Dec 2009
âȘ Roll out started in Nov 2010
âȘ Fully rolled out by July 2011 (migrated 1B+ accounts from
legacy messages!)
While in production:
âȘ Schema changes: not once, but twice!
âȘ Implemented & rolled out HFile V2 and numerous other
optimizations in an upward compatible manner!
11. Shadow Testing: Before Rollout
âȘ Both product and infrastructure were changing.
âȘ Shadows the old messages product + chat while the new one was
under development
Messages Chat Backfilled data from
V1 Frontend Frontend MySQL for % of users and
mirrored their production
traffic
Application Servers
Cassandra MySQL ChatLogger
HBase/HDFS Cluster
Production Shadow
12. Shadow Testing: After Rollout
âȘ Shadows new version of the Messages product.
âȘ All backend changes go through shadow cluster before prod push
New
Messages
Frontend
Application Servers Application Servers
HBase/HDFS Cluster
HBase/HDFS Cluster
Shadow
Production
13. Backup/Recovery (V1)
âą During early phase, concerned about potential bugs in HBase.
âą Off-line backups: written to HDFS via Scribe
âą Recovery tools; testing of recovery tools
Message
Frontend
Application Servers Scribe/HDFS Local
Datacenter
HDFS Remote
Datacenter
âą Merge & Dedup HDFS
HBase/HDFS Cluster âą Store locally, and
âą Copy to Remote DC
Double log from AppServer & HBase
to reduce probability of data loss
14. Backups (V2)
âȘ Now, does periodic HFile level backups.
âȘ Working on:
âȘ Moving to HFile + Commit Log based backups to be able to recover to
finer grained points in time
âȘ Avoid need to log data to Scribe.
âȘ Zero copy (hard link based) fast backups
15. Messages Schema & Evolution
âȘ âActionsâ (data) Column Family the source of truth
âȘ Log of all user actions (addMessage, markAsRead, etc.)
âȘ Metadata (thread index, message index, search index) etc. in other
column families
âȘ Metadata portion of schema underwent 3 changes:
âȘ Coarse grained snapshots (early development; rollout up to 1M users)
âȘ Hybrid (up to full rollout â 1B+ accounts; 800M+ active)
âȘ Fine-grained metadata (after rollout)
âȘ MapReduce jobs against production clusters!
âȘ Ran in throttled way
âȘ Heavy use of HBase bulk import features
16. Write Path Overview
Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1
Memstore
HFiles (in HDFS)
Write Ahead Log ( in HDFS)
Append/Sync
17. Flushes: Memstore -> HFile
Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1
Memstore
HFiles flush
Data in HFile is sorted; has block index for efficient
retrieval
18. Read Path Overview
Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1
Memstore
HFiles
Get
19. Compactions
Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1
Memstore
HFiles
20. Reliability: Early work
âȘ HDFS sync support for durability of transactions
âȘ Multi-CF transaction atomicity
âȘ Several bug fixes in log recovery
âȘ New block placement policy in HDFS
âȘ To reduce probability of data loss
21. Availability: Early Work
âȘ Common reasons for unavailability:
âȘ S/W upgrades
âȘ Solution: rolling upgrades
âȘ Schema Changes
âȘ Applications needs new Column Families
âȘ Need to change settings for a CF
âȘ Solution: online âalter tableâ
âȘ Load balancing or cluster restarts took forever
âȘ Upon investigation: stuck waiting for compactions to finish
âȘ Solution: Interruptible Compactions!
22. Performance: Early Work
âȘ Read optimizations:
âȘ Seek optimizations for rows with large number of cells
âȘ Bloom Filters
âȘ minimize HFile lookups
âȘ Timerange hints on HFiles (great for temporal data)
âȘ Multigets
âȘ Improved handling of compressed HFiles
23. Performance: Compactions Was
âȘ Critical for read performance
âȘ Old Algorithm:
#1. Start from newest file (file 0); include next file if:
âȘ size[i] < size[i-1] * C (good!)
#2. Always compact at least 4 files, even if rule #1 isnât met.
Solution:
Now
#1. Compact at least 4 files, but only if eligible files found.
#2. Also, new file selection based on summation of sizes.
size[i] < (size[0] + size[1] + âŠsize[i-1]) * C
24. Performance: Compactions
âȘ More problems!
âȘ Read performance dips during Small Compaction Q Big Compaction Q
peak
âȘ Major compaction storms Compaction Compaction
âȘ Large compactions bottleneck Compaction Compaction
âȘ Enhancements/fixes: Compaction
âȘ Staggered major compactions
Compaction
âȘ Multi-thread compactions;
separate queues for small & big
compactions
âȘ Aggressive off-peak compactions
25. Metrics, metrics, metricsâŠ
âȘ Initially, only had coarse level overall metrics (get/put latency/ops;
block cache counters).
âȘ Slow query logging
âȘ Added per Column Family stats for:
âȘ ops counts, latency
âȘ block cache usage & hit ratio
âȘ memstore usage
âȘ on-disk file sizes
âȘ file counts
âȘ bytes returned, bytes flushed, compaction statistics
âȘ stats by block type (data block vs. index blocks vs. bloom blocks, etc.)
âȘ bloom filter stats
26. Metrics (contd.)
âȘ HBase Master Statistics:
âȘ Number of region servers alive
âȘ Number of regions
âȘ Load balancing statistics
âȘ ..
âȘ All stats stored in Facebookâs Operational Data Store (ODS).
âȘ Lots of ODS dashboards for debugging issues
âȘ Side note: ODS planning to use HBase for storage pretty soon!
27. Need to keep up as data grows on you!
âȘ Rapidly iterated on several new features while in production:
âȘ Block indexes upto 6GB per server! Cluster starts taking longer and
longer. Block cache hit ratio on the decline.
âȘ Solution: HFile V2
âȘ Multi-level block index, Sharded Bloom Filters
âȘ Network pegged after restarts
âȘ Solution: Locality on full & rolling restart
âȘ High disk utilization during peak
âȘ Solution: Several âseekâ optimizations to reduce disk IOPS
âȘ Lazy Seeks (use time hints to avoid seeking into older HFiles)
âȘ Special bloom filter for deletes to avoid additional seek
âȘ Utilize off-peak IOPS to do more aggressive compactions during
28. Scares & Scars!
âȘ Not without our share of scares and incidents:
âȘ s/w bugs. (e.g., deadlocks, incompatible LZO used for bulk imported data, etc.)
âȘ found a edge case bug in log recovery as recently as last week!
âȘ performance spikes every 6 hours (even off-peak)!
âȘ cleanup of HDFSâs Recycle bin was sub-optimal! Needed code and config fix.
âȘ transient rack switch failures
âȘ Zookeeper leader election took than 10 minutes when one member of the
quorum died. Fixed in more recent version of ZK.
âȘ HDFS Namenode â SPOF
âȘ flapping servers (repeated failures)
29. Scares & Scars! (contd.)
âȘ Sometimes, tried things which hadnât been tested in dark launch!
âȘ Added a rack of servers to help with performance issue
âȘ Pegged top of the rack network bandwidth!
âȘ Had to add the servers at much slower pace. Very manual ï.
âȘ Intelligent load balancing needed to make this more automated.
âȘ A high % of issues caught in shadow/stress testing
âȘ Lots of alerting mechanisms in place to detect failures cases
âȘ Automate recovery for a lots of common ones
âȘ Treat alerts on shadow cluster as hi-pri too!
âȘ Sharding service across multiple HBase cells also paid off
30. Future Work
âȘ Reliability, Availability, Scalability!
âȘ Lot of new use cases on top of HBase in the works.
âȘ HDFS Namenode HA
âȘ Recovering gracefully from transient issues
âȘ Fast hot-backups
âȘ Delta-encoding in block cache
âȘ Replication
âȘ Performance (HBase and HDFS)
âȘ HBase as a service Multi-tenancy
âȘ Features- coprocessors, secondary indices
31. Acknowledgements
Work of lot of people spanning HBase, HDFS, Migrations,
Backup/Recovery pipeline, and Ops!
Karthik Ranganathan Mikhail Bautin
Nicolas Spiegelberg Pritam Damania
Aravind Menon Prakash Khemani
Dhruba Borthakur Doug Porter
Hairong Kuang Alex Laslavic
Jonathan Gray Kannan Muthukkaruppan
Rodrigo Schmidt + Apache HBase
Amitanand Aiyer + Interns
Guoqiang (Jerry) Chen + Messages FrontEnd Team
Liyin Tang + Messages AppServer Team
Madhu Vaidya + many more folksâŠ