2. History
Stats
HowWe Store Data
Challenges
MistakesWe Made
Tips / Patterns
Future
Moral of the Story
3. 2008 –Flurry Analytics for MobileApps
Sharded MySQL, or
HBase!
Launched on 0.18.1 with a 3 node cluster
Great community
Now running 0.94.5 (+ patches)
2 data centers with 2 clusters each
Bidirectional replication
4. 1000 slave nodes per cluster
32 GB RAM, 4 drives (1 or 2TB), 1 GigE, dual quad-
core * 2 HT = 16 procs
DataNode,TaskTracker, RegionServer
(11GB), 5 Mappers, 2 Reducers
~30 tables, 250k regions, 430TB (after LZO)
2 big tables are about 90% of that
▪ 1 wide table: 3 CF, 4 billion rows, up to 1MM cells per row
▪ 1 tall table: 1 CF, 1 trillion rows, most 1 cell per row
5. 12 physical nodes
5 region servers with 20GB heaps on each
1 table - 8 billion small rows - 500GB (LZO)
All in block cache (after 20 minute warmup)
100k-1MM QPS - 99.9% Reads
2ms mean, 99% <10ms
25 ms GC pause every 40 seconds
slow after compaction
6. DAO for Java apps
Requires:
▪ writeRowIndex / readRowIndex
▪ readKeyValue / writeRowContents
Provides:
▪ save / delete
▪ streamEntities / pagination
▪ MR input formats on entities (rather than Result)
Uses HTable or asynchbase
7. Change row key format
DAO supports both formats
1. Create new table
2. Writes to both
3. Migrate existing
4. Validate
5. Reads to new table
6. Write to (only) new table
7. Drop old table
8. Bottlenecks (not horizontally scalable)
HMaster (e.g. HLog cleaning falls behind creation
[HBASE-9208])
NameNode
▪ Disable table / shutdown => many HDFS files at once
▪ Scan table directory => slow region assignments
ZooKeeper (HBase replication)
JobTracker (heap)
META region
9. Too many regions (250k)
Max size 256M -> 1 GB -> 5 GB
Slow reassignments on failure
Slow hbck recovery
Lots of META queries / big client cache
▪ Soft refs can exacerbate
Slow rolling restarts
More failures (Common and otherwise)
Zombie RS
10. Latency long tail
HTable Flush write buffer
GC pauses
RegionServer failure
(SeeTheTail at Scale – Jeff Dean, Luiz André Barroso)
11. Shared cluster for MapReduce and live
queries
IO bound requests hog handler threads
Even cached reads get slow
RegionServer falls behind, stays behind
If the cluster goes down, it takes awhile to come
back
12. HDFS-5042 Completed files lost after power failure
ZOOKEEPER-1277 servers stop serving when lower 32bits of
zxid roll over
ZOOKEEPER-1731 Unsynchronized access to
ServerCnxnFactory.connectionBeans results in deadlock
13. Small region size -> many regions
Nagle’s
Trying to solve a crisis you don’t understand
(hbck fixSplitParents)
Setting up replication
Custom backup / restore
CopyTable OOM
Verification
14. Compact data matters (even with
compression)
Block cache, network not compressed
Avoid random reads on non cached tables (duh!)
Write cell fragments, combine at read time to
avoid doing random reads
compact later - coprocessor?
can lead to large rows
▪ probabilistic counter
15. HDFS HA
Snapshots (see how it works with 100k
regions on 1000 servers)
2000 node clusters
test those bottlenecks
larger regions, larger HDFS blocks, larger HLogs
More (independent) clusters
Load aware balancing?
Separate RPC priorities for workloads
0.96
16. Scaled 1000x and more on the same DB
If you’re on the edge you need to understand
your system
Monitor
Open Source
Load test
Know your load
Disk or Cache (or SSDs?)