2. HBase
HBase is a distributed column-oriented data store built on top of HDFS.
HBase is an Apache open source project whose goal is to provide storage for
the Hadoop Distributed Computing .
HBase does not support a structured query language like SQL; in fact, HBase
isn’t a relational data store at all. HBase applications are written in Java much
like a typical Apache™ MapReduce application.
5. HBase Components
MasterServer:
Assigns regions to the region servers and takes the help of Apache ZooKeeper
for this task.
Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
HBase HMaster performs DDL operations (create and delete tables) and
assigns regions to the Region servers as you can see in the above image.
6. It assigns regions to the Region Servers on startup and re-assigns regions
to Region Servers during recovery and load balancing.
It provides an interface for creating, deleting and updating tables.
7. Region Server:
Many regions are assigned to a Region Server, which is responsible for
handling, managing, executing reads and writes operations on that set of
Communicate with the client and handle data-related operations.
Region:
A region contains all the rows between the start key and the end key assigned
to that region. HBase tables can be divided into a number of regions in such a
way that all the columns of a column family is stored in one region. Each region
contains the rows in a sorted order.
8. The store contains memory store and HFiles. Memstore is just like a cache
memory. Anything that is entered into the HBase is stored here initially. Later,
the data is transferred and saved in Hfiles as blocks and the memstore is
flushed.
9. Zookeeper:
Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
Clients communicate with region servers via zookeeper.
Every Region Server along with HMaster Server sends continuous heartbeat
regular interval to Zookeeper and it checks which server is alive and available
mentioned in above image. It also provides server failure notifications so that,
recovery measures can be executed.
10. there is an inactive server, which acts as a backup for active server. If the active
server fails, it comes for the rescue.
The active HMaster sends heartbeats to the Zookeeper while the inactive
HMaster listens for the notification send by active HMaster. If the active HMaster
fails to send a heartbeat the session is deleted and the inactive HMaster
becomes active.
11. Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a
sequential manner.
That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should
also be processed sequentially.
At this point, a new solution is needed to access any point of data in a single
unit of time (random access).
12. Hadoop Random Access Database
Applications such as HBase, Cassandra, Dynamo, and MongoDB are some of
the databases that store huge amounts of data and access the data in a
random manner.
13. What Is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data.
It leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
14. HBase Data Model
Data stored in HBase is located by its "rowkey“.
Rowkey is like a primary key from a relational
database.
Records in HBase are stored in sorted order,
according to rowkey.
Data in a row are grouped together as Column
Families. Each Column Family has one or more
Columns .
These Columns in a family are stored together in
a low level storage file known as Hfile.
15. Data Model (Region)
Tables are divided into sequence of rows, by key range, called regions.
These regions are then assigned to the data node in the cluster called
“Regionserver”.
16. Data Model (column family)
A column is identified by an column qualifier that consist of the column
family name concatenated With the column name using a colon.
Eg- Personaldata:name.
Column family are mapped to storage files and are stored in separate file,
which can also be accessed separately
17. Data Model (cell)
Data is stored in HBASE tables Cells.
Cell is a combination of row, column family .column qualifier and contains a
value and timestamp.
The key consists of the row key, column family name, column name, and
timestamp.
The entire cell, with the added structural information, is called Key Value.
18. Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns
of data, rather than as rows of data. Shortly, they will have column families.
Row-Oriented Database Column-Oriented Database
It is suitable for Online
Transaction Process (OLTP).
It is suitable for Online
Analytical Processing (OLAP).
Such databases are designed
for small number of rows and
columns.
Column-oriented databases
are designed for huge tables.
The following image shows column
families in a column-oriented
database:
19. ID NAME SALARY
101 ABC 100
102 PQR 200
Normal Row Representation
ID 101 102
NAME ABC PQR
SALARY 100 200
Normal Column Representation
HBase Representation
ROW COLUMN+CELL
COLUMNFAMILY NAME
TIMESTAM
P
VALUE
101 COLUMN=CF:NAME T1 VALUE=ABC
101 COLUMN=CF:SALARY T2 VALUE=100
102 COLUMN=CF:NAME T3 VALUE=PQR
102 COLUMN=CF:SALARY T4 VALUE=200
CF=name of the
column family
20. HBase Shell Commands
To start the shell at command line: $hbase shell
Getting the status of the system: hbase ( main):002:0> status
Creating the table: hbase(main):005:0> create 'emp', 'personal details’,
'professional details’
Describe the table: hbase(main):017:0> describe 'emp’
List the tables present in keyspace: hbase(main):001:0> list
Inserting data into the table: hbase (main):018:0> put 'emp', 1, personal
details:name, Ram
21. Viewing records inserted in the table: hbase(main):023:0> scan 'emp’
Getting the record from Hbase table: hbase(main):026:0> get 'emp', ‘1’
Getting a specific column from the record: hbase(main):002:0> get 'emp', '1.
{COLUMN => 'personal details.name}
Dropping a table. We can drop a table by first disabling it and then executing
the dropped table:
hbase(main):016:0> disable 'emp’
hbase(main):017:0> drop 'emp’
22. Application of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to
available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase
internally.