6. Monthly data volume prior to launch
15B x 1,024 bytes = 14TB
120B x 100 bytes = 11TB
7. Messaging Data
âȘ Small/medium sized data HBase
âȘ Message metadata & indices
âȘ Search index
âȘ Small message bodies
âȘ Attachments and large messages Haystack
âȘ Used for our existing photo/video store
8. Open Source Stack
âȘ Memcached --> App Server Cache
âȘ ZooKeeper --> Small Data Coordination Service
âȘ HBase --> Database Storage Engine
âȘ HDFS --> Distributed FileSystem
âȘ Hadoop --> Asynchronous Map-Reduce Jobs
9. Our architecture
User Directory Service
Clients
(Front End, MTA, etc.)
Whatâs the cell for
this user?
Cell 2
Cell 1 Cell 3
Cell 1 Application Server
Application Server Application Server
Attachments
HBase/HDFS/Z
HBase/HDFS/Z KMessage, Metadata,
HBase/HDFS/Z
K Search Index
K
Haystack
11. HBase in a nutshell
âą distributed, large-scale data store
âą efficient at random reads/writes
âą initially modeled after Googleâs BigTable
âą open source project (Apache)
12. When to use HBase?
âȘ storing large amounts of data
âȘ need high write throughput
âȘ need efficient random access within large data sets
âȘ need to scale gracefully with data
âȘ for structured and semi-structured data
âȘ donât need full RDMS capabilities (cross table transactions, joins, etc.)
13. HBase Data Model
âą An HBase table is:
âą a sparse , three-dimensional array of cells, indexed by:
RowKey, ColumnKey, Timestamp/Version
âą sharded into regions along an ordered RowKey space
âą Within each region:
âą Data is grouped into column families
âȘ Sort order within each column family:
Row Key (asc), Column Key (asc), Timestamp (desc)
14. Example: Inbox Search
âą Schema
âą Key: RowKey: userid, Column: word, Version: MessageID
âą Value: Auxillary info (like offset of word in message)
âą Data is stored sorted by <userid, word, messageID>:
User1:hi:17->offset1
Can efficiently handle queries like:
User1:hi:16->offset2
User1:hello:16->offset3 - Get top N messageIDs for a
User1:hello:2->offset4 specific user & word
...
User2:.... - Typeahead query: for a given user,
User2:... get words that match a prefix
...
15. HBase System Overview
Database Layer
HBASE
Master Backup
Master
Region Region Region ...
Server Server Server
Coordination Service
Storage Layer
HDFS Zookeeper Quorum
Namenode Secondary Namenode ZK ZK ...
Peer Peer
Datanode Datanode Datanode ...
16. HBase Overview
HBASE Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1 Memstore
(in memory data structure)
HFiles (in HDFS) flush
Write Ahead Log ( in HDFS)
17. HBase Overview
âą Very good at random reads/writes
âą Write path
âą Sequential write/sync to commit log
âą update memstore
âą Read path
âą Lookup memstore & persistent HFiles
âą HFile data is sorted and has a block index for efficient retrieval
âą Background chores
âą Flushes (memstore -> HFile)
âą Compactions (group of HFiles merged into one)
19. Horizontal scalability
âȘ HBase & HDFS are elastic by design
âȘ Multiple table shards (regions) per physical server
âȘ On node additions
âȘ Load balancer automatically reassigns shards from overloaded
nodes to new nodes
âȘ Because filesystem underneath is itself distributed, data for
reassigned regions is instantly servable from the new nodes.
âȘ Regions can be dynamically split into smaller regions.
âȘ Pre-sharding is not necessary
âȘ Splits are near instantaneous!
20. Automatic Failover
âȘ Node failures automatically detected by HBase Master
âȘ Regions on failed node are distributed evenly among surviving nodes.
âȘ Multiple regions/server model avoids need for substantial
overprovisioning
âȘ HBase Master failover
âȘ 1 active, rest standby
âȘ When active master fails, a standby automatically takes over
21. HBase uses HDFS
We get the benefits of HDFS as a storage system for free
âȘ Fault tolerance (block level replication for redundancy)
âȘ Scalability
âȘ End-to-end checksums to detect and recover from corruptions
âȘ Map Reduce for large scale data processing
âȘ HDFS already battle tested inside Facebook
âȘ running petabyte scale clusters
âȘ lot of in-house development and operational experience
22. Simpler Consistency Model
âȘ HBaseâs strong consistency model
âȘ simpler for a wide variety of applications to deal with
âȘ client gets same answer no matter which replica data is read from
âȘ Eventual consistency: tricky for applications fronted by a cache
âȘ replicas may heal eventually during failures
âȘ but stale data could remain stuck in cache
23. Typical Cluster Layout
âȘ Multiple clusters/cells for messaging
âȘ 20 servers/rack; 5 or more racks per cluster
âȘ Controllers (master/Zookeeper) spread across racks
ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer
HDFS Namenode Backup Namenode Job Tracker Hbase Master Backup Master
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
19x... 19x... 19x... 19x... 19x...
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
Rack #1 Rack #2 Rack #3 Rack #4 Rack #5
26. Goal of Zero Data Loss/Correctness
âȘ sync support added to hadoop-20 branch
âȘ for keeping transaction log (WAL) in HDFS
âȘ to guarantee durability of transactions
âȘ Row-level ACID compliance
âȘ Enhanced HDFSâs Block Placement Policy:
âȘ Original: rack aware, but minimally constrained
âȘ Now: Placement of replicas constrained to configurable node groups
âȘ Result: Data loss probability reduced by orders of magnitude
27. Availability/Stability improvements
âȘ HBase master rewrite- region assignments using ZK
âȘ Rolling Restarts â doing software upgrades without a downtime
âȘ Interrupt Compactions â prioritize availability over minor perf gains
âȘ Timeouts on client-server RPCs
âȘ Staggered major compaction to avoid compaction storms
28. Performance Improvements
âȘ Compactions
âȘ critical for read performance
âȘ Improved compaction algo
âȘ delete/TTL/overwrite processing in minor compactions
âȘ Read optimizations:
âȘ Seek optimizations for rows with large number of cells
âȘ Bloom filters to minimize HFile lookups
âȘ Timerange hints on HFiles (great for temporal data)
âȘ Improved handling of compressed HFiles
29. Operational Experiences
âȘ Darklaunch:
âȘ shadow traffic on test clusters for continuous, at scale testing
âȘ experiment/tweak knobs
âȘ simulate failures, test rolling upgrades
âȘ Constant (pre-sharding) region count & controlled rolling splits
âȘ Administrative tools and monitoring
âȘ Alerts (HBCK, memory alerts, perf alerts, health alerts)
âȘ auto detecting/decommissioning misbehaving machines
âȘ Dashboards
âȘ Application level backup/recovery pipeline
30. Working within the Apache community
âȘ Growing with the community
âȘ Started with a stable, healthy project
âȘ In house expertise in both HDFS and HBase
âȘ Increasing community involvement
âȘ Undertook massive feature improvements with community help
âȘ HDFS 0.20-append branch
âȘ HBase Master rewrite
âȘ Continually interacting with the community to identify and fix issues
âȘ e.g., large responses (2GB RPC)
33. Move messaging data from MySQL to HBase
âȘ In MySQL, inbox data was kept normalized
âȘ userâs messages are stored across many different machines
âȘ Migrating a user is basically one big join across tables spread over
many different machines
âȘ Multiple terabytes of data (for over 500M users)
âȘ Cannot pound 1000s of production UDBs to migrate users
34. How we migrated
âȘ Periodically, get a full export of all the usersâ inbox data in MySQL
âȘ And, use bulk loader to import the above into a migration HBase
cluster
âȘ To migrate users:
âȘ Since users may continue to receive messages during migration:
âȘ double-write (to old and new system) during the migration period
âȘ Get a list of all recent messages (since last MySQL export) for the
user
âȘ Load new messages into the migration HBase cluster
âȘ Perform the join operations to generate the new data
âȘ Export it and upload into the final cluster
36. Facebook Insights Goes Real-Time
âȘ Recently launched real-time analytics for social plugins on top of
HBase
âȘ Publishers get real-time distribution/engagement metrics:
âȘ # of impressions, likes
âȘ analytics by
âȘ Domain, URL, demographics
âȘ Over various time periods (the last hour, day, all-time)
âȘ Makes use of HBase capabilities like:
âȘ Efficient counters (read-modify-write increment operations)
âȘ TTL for purging old data
37. Future Work
It is still early daysâŠ!
âȘ Namenode HA (AvatarNode)
âȘ Fast hot-backups (Export/Import)
âȘ Online schema & config changes
âȘ Running HBase as a service (multi-tenancy)
âȘ Features (like secondary indices, batching hybrid mutations)
âȘ Cross-DC replication
âȘ Lot more performance/availability improvements