Apache HBase has been widely adopted at many enterprises. In this talk we will cover a few war stories with troubleshooting, tuning and fixing problems with HBase Cluster. We will be covering some of the best practices, tools , utilities and lessons learnt from evaluating deployments at different organizations
4. Intro to HBase
• Fault Tolerant
• Horizontally Scalable
• Real-time Random read-write access to data
stored in HDFS
• Millions of queries / second
• Support for transactions at a single row level
• Bloom filters
• Automatic Sharding
• Implemented in Java
5. Data Model
• Data is stored in Tables
• Tables contain rows
– Rows are referenced by a unique key - Rowkey
• Rows are made of columns which are grouped
in column families
• Rows are sorted
• Everything is stored as a sequence of bytes
• All entries are versioned and timestamped
10. HBase API
• API is simple
• Operations
– Get,Put,Delete,Scan,MapReduce
• Connection
• Create this instance only once per application and
share it during its runtime
• Htable
– Zookeeper
• HBase:meta
11. Column Families
• All columns that are accesed together need to be
grouped into a Column Family
• No need to access or load data that is not used
• At the column family we can define the settings
like
– compression, version retention policy, cache priority
– Understand the data, access pattern and group
column family
• Column Family and Column Qualifiers are stored
as bytes
– Avoid being verbose
13. HBase Compactions
• HDFS does not support updates
– HFiles are immutable
– New HFiles are created
• Minor Compactions
– Small HFiles are merged into larger Hfiles
– Deletes are not applied
• Major Compactions
– Hfiles with in column family are merged into Single
Hfile
– Deletes are applied
18. Pre-Splitting
• Region splitting
– Grows untill it needs to be split
– Region at a time is served by only 1 Region Server
• Pre-split a table into regions at table creation
time
– Uniformly distribute write load across region servers
– Understand the keyspace
• Risk of uneven load distribution
• Auto splitting
– Constant size region split policy
– IncreasingToUpperBoundRegionSplitPolicy
19. Bulk Loading
• Native API
– Disable WAL
• MapReduce Job to generate Hfile
– Load using completebulkload / importTSV tool
• Loads into relevant region
– Faster than going through normal write path
• No writes to WAL and Memstore
• No flushing and compacting
20. Troubleshooting
• ulimit -n
– Limits on number of files and processs
• HBase is database and needs to open a number
of files
• dfs.datanode.max.transfer.threadsrr.
• Network
• OS Parameters
23. Tuning
• Heavy Writes
– Flushes, compacting,splitting increase IO and degrade
cluster performance
• Keep Region sizes larger
• Keep Hfile size large
• Heavy Sequential Reads
• Higher block size
• Avoid Caching on table
• Heavy Random Reads
• Higher Blocklevel cache
• Lower Memstore limit
• Smaller block size
24. Apache Phoenix
• SQL over Hbase
– Compiles into Hbase Scans
– Orchetrates parallel execution
– Aggregate queries
• JDBC API’s over Native HBase API.
• Salting Buckets PreSplitting
• Trafodion
– Transactional SQL on HBase
25. Hannibal
• Monitor and maintain HBase Clusters
• How well regions are balanced over the
cluster?
• How well regions are split for each table
• How regions evolve over time
• How long compactions take
• Integration with HUE