HBase adoption continues to explode amid rapid customer success and unbridled innovation. HBase with its limitless scalability, high reliability and deep integration with Hadoop ecosystem tools, offers enterprise developers a rich platform on which to build their next generation applications. In this workshop we will explore HBase SQL capabilities, deep Hadoop ecosystem integrations and deployment & management best practices.
Apache HBase is a NoSQL database built natively on Hadoop and HDFS.
HBase scales horizontally, so you can store and manage huge datasets with great performance and low cost.
HBase caches hot data in memory so data access happens in milliseconds.
HBase offers a flexible schema, you decide your schema on reads or writes, so HBase is great for dealing with messy and multistructured data.
HBase SQL and NoSQL APIs, NoSQL using HBase's native NoSQL interface or Apache Phoenix, a SQL interface that runs on top of HBase.
Finally, because HBase is native to Hadoop, data in HBase can be processed in MapReduce, Tez or any of the dozens of other tools in the Hadoop analytics world.
HBase is used by some of the biggest web companies, like Facebook who use it for their Messages and Nearyby Friends features, and eBay who use search indexing.
If you're new to HBase and want to learn more, check out hortonworks.com/hadoop/hbase to find out more.
See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
Table == Sorted map of maps (like a OrderedDictionary, TreeMap. It’s all just bytes!)
Access by coordinates: rowkey, column family, column qualifier, timestamp
Basic KV operations: GET, PUT, DELETE
Complex query: SCAN over rowkey range (remember, ordered rowkeys. *this* is schema design)
INCREMENT, APPEND, CheckAnd{Put,Delete} (server-side atomic. Requires a lock; can be contentious)
NO: secondary indices, joins, multi-row transactions
Column-Family oriented.
See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
Table == Sorted map of maps (like a OrderedDictionary, TreeMap. It’s all just bytes!)
Access by coordinates: rowkey, column family, column qualifier, timestamp
Basic KV operations: GET, PUT, DELETE
Complex query: SCAN over rowkey range (remember, ordered rowkeys. *this* is schema design)
INCREMENT, APPEND, CheckAnd{Put,Delete} (server-side atomic. Requires a lock; can be contentious)
NO: secondary indices, joins, multi-row transactions
Column-Family oriented.
See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
Records ordered by rowkey (write-side sort, application feature)
Continuous sequences of rows partitioned into Regions
Regions automatically distributed around the cluster ((mostly) hands-free partition management)
Regions automatically split when they grow too large (split by size (bytes), on row boundary)
Records ordered by rowkey (write-side sort, application feature)
Continuous sequences of rows partitioned into Regions
Regions automatically distributed around the cluster ((mostly) hands-free partition management)
Regions automatically split when they grow too large (split by size (bytes), on row boundary)
To start off we'll talk about how HBase High Availability has gotten substantially better over the past 18 months.
From the beginning, HBase offered 2 levels of protection to ensure high availability.
First, HBase partitions data across multiple nodes, making each node responsible for ranges of the over dataset held within HBase. Before HBase HA, if you lose a node you only lose access to the data on that node, all other data in the database could still be read and written. This is indicated with point (1) here.
Second, HBase stores all its data in HDFS so that data is highly available and if a node is truly lost, all HBase needs to do is spend a few minutes recovering that data on one of the remaining nodes. That's indicated with point (2).
But what happens during that recovery process? During the few minutes it takes to recover, data in that node can't be read or written, it's unavailable. For many apps this situation is ok, a lot of HBase production applications have managed to meet 99.9% uptimes with this system.
But some applications need better HA guarantees, which led to HBase HA.
HBase HA adds a 3rd layer of protection by replicating data to multiple regionservers in the cluster.
With HBase HA you have primary regionservers and standby regionservers, each key range is held on more than one server so even if you lose a single server all its data is still available for read.
HBase HA uses an HA model called timeline consistent read replicas.
With HBase HA all writes are still handled exclusively by the primary, so you still get strong consistency for updates and operations like increments.
Replication is done asynchronously so data in standby regionservers may be stale relative to data in primary. Usually the data will agree in less than a second but if the system is busy the replicas could lag the primary by several seconds.
HBase clients now have the ability to decide if they need strong consistency or if they are willing to sacrifice strong consistency on reads for better availability. This can be done on a per get or per scan basis.
A lot of HBase applications are read heavy and with HBase HA it's straightforward to achieve 4 9s availability for these sorts of applications. Overall HBase HA is a great addition for any mission critical apps on Hadoop.