The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
2. Comparison Between HBase and Cassandra using YCSB
Nisheet Mahajan
X16133099
Data Storage and Management
3. INTRODUCTION
In today’ era with the massive generation of data by the users, a new database management
system which is faster and simpler to access billions of data is required. Big data consists of
three V’s (the dimension): Velocity, volume and variety with volume refencing as the
amount of data an organization has stored, velocity referencing as the speed with which
data is generated and analyzed and variety referencing to varieties of data that can be
generated in real world. Volume and variety of data causes lot of problem to an
organization in RDBMS(Relational Database Management System) world. Because of which
new systems have been introduced for storage and management which includes:
Cassandra, Voldemort, Hbase and others, also referred to as NoSql databases. They have
the capacity to index and store massive data sets as per the user requests and to trade off
past consistency for other properties which are much more useful.
The basic concern for any data related system is high performance. Nosql performs better
than relational database management system (RDBMS) in various cases. Researchers have
been looking out for a fully optimized database matching their use cases. To evaluate the
cloud DB’s performance on a common set of workloads Yahoo presented a framework
named as YCSB (Yahoo cloud Serving Benchmark). Yahoo cloud Serving Benchmark is an
extensible workload generator- define new workloads to test system aspects and to
execute the Core workloads. It is commonly used to compare and benchmark multiple
systems. The two NoSQL databases we will considering for our research work will be
Hbase and Cassandra. The combination posses durability initializing from logging all
4. [write] operations to a file. HBase and Cassandra both prevent loss of data caused of cluster
nodes failure while replication.
Characteristics of HBase:
It is sorted map, sparse, distributed and consistent. The indexing in Hbase happens in row
key, timestamp and column family. Data is stored in tables which organize it into rows and
the rows have a unique row key for identification. The following are the basic components
for HBase :
HBase is linearly scalable with an automatic failure support.
It provides consistent read and writes with easy java API for client.
Replicates data across clusters.
Column Families :
These are defined at the time of creation of schema of the table as they are not easy to
modify.
Column Qualifier :
The column family contains massive no. of column qualifier and treated as byte array
Characteristic of Cassandra :
It is an open source NoSQL database offering operational simplicity, linear scale
performance , easy data distribution with continuous availability. Cassandra is can run on
different machines with no single point of failure. It has peer to peer architecture without
master slave issue.
✓ Linear scale performance – Nodes are added producing increases in performance.
✓ Continuous availability – The data and node function offers redundancy and gives
constant uptime.
✓ Transparent fault detection and recovery – The failed node can be restored or
replaced.
✓ Flexible and dynamic data model – The data types support fast writes and reads.
✓ Strong data protection – a commit log design ensures no data loss and built in
security with backup/restore keeps data protected and safe.
✓ Tunable data consistency
✓ Multi-data center replication
5. ✓ Column families:
Column families contain information of the column defined based on an application.
They are static (which are defined by Cassandra) and dynamic (which are defined
by the users) column family. There are few types of column family namely:
Standard(one primary key), composite(multiple primary keys),expiring(gets
deleted after sometime) and counter(keeping track of occurrence of events).
✓ CQL:
It is a primary and default interface simplifying data modelling.
Cassandra is designed to handle workloads, without any single failure, across many nodes.
The nodes in a cluster are independent and interconnected to each other. Each node can
accept read and write requests, irrespective of the location of data in cluster. Cassandra has
become the most reliable choice for business and technical stakeholders.
Validator (datatype of column) and comparator (datatype of column name) are two
datatypes in Cassandra which are defined at the time of creation of column families.
Database Architecture
HBase :
The architecture of HBase consists of tables which are divided into regions and served by
region servers. The regions are divided by the column families into Stores which are saved
as a file in HDFS.
6. The client library, the master server and the region server are the major components of the
HBase.
Hbase Master :
It is the master server responsible for monitoring all Region servers. HBase Master is
responsible for performing sharding (load balancing). HBase has the capacity to run
multiple Hbase master in cluster keeping only one active at a time. HBase master is
responsible for assigning regions to RegionServers. The metadata changes undergo
through Master.
RegionServer:
The machine running on the region server is considered as a worker node. The region
server is considered to be an implementation of worker module. It is responsible for
splitting and compacting regions running on a datanode. The Multiple Region Server runs
in a cluster.
Zookeeper:
Processes which are distributed can coordinate with each other with the help of a shared
hierarchal name space. Zookeeper is HBase is responsible for:
✓ Providing availability status of RegionServers
✓ Ensuring single active HMaster in the cluster
✓ Providing location of “-ROOT-“ table
✓ Selecting new HMaster in case of failure of active HMaster
Hbase gives the flexibility to the client to connect to any node in the cluster. To coordinate
with client and master mode HBase relies on Zookeeper.
Cassandra:
Cassandra has a node based , fault-tolerant , scalable and consistent architecture. Lowest
level in a Cassandra cluster is node and a Single instance is represented by a node.
7. Both the datacenters , nodes and racks comprise up to Cassandra architecture. It is a shared
nothing environment with no central controller
Data Partitioning:
A distributed database is partitioned across nodes as it divides data equally around its
cluster of nodes.
Data Replication:
Multiple nodes in a cluster behave as the replicas for a given piece of data. If an out of date
value is responded by the nodes, Cassandra returns the most recent value of node.
Key spaces:
Cassandra creates one keyspace which stores the column families and data.
Node & data center:
A collection of related nodes with a place to store nodes.
Cluster:
It contains one or more than one data centers
Mem-Table:
it is a popup utility data structure and has multiple mem-tables for a single column family.
8. Commit Log :
Crash–recovery mechanism in Cassandra
SSTable:
It is a disk file to which the data is flushed from the mem-table when its contents reach a
threshold value
Bloom filter:
They are nondeterministic, quick , algorithm to check whether an element is member or
not of the set. Bloom filters are cache which are accessed post every query.
Comparison between HBase and Cassandra in terms of Scalability, Availability and
Reliability:
The analyzation is based on the CAP theory :
•Scalability:
Casandra and HBase are scalable databases as the rows and column families are descried in
advance it is easier to add new columns on the fly.Cassandra accomplishes direct
adaptability by adding nodes in the cluster and the framework is devised to the point that
the cluster will use the newly added resource.
•Availability:
Cassandra is opted out best In terms of availability as it has consistent database solution.
Casandra has a node distributed architecture as a result the data is replicated over the
nodes and if any node fails down Cassandra generates a response thus making it highly
available.
•Reliability:
HBase has features like Hadoop support and range based row scans which makes it more
reliable than Cassandra. HBase meets the consistency and Partitioning of CAP theory as
well as it is strongly consistent as well.
9. Performance Test Plan
Physical Machine
Processor : 2.40 GHz Intel Core i5 (64-bit)
Number of Cores : 2
Memory: 8GB
Operating system : Microsoft Windows 7 Professional
Virtualization Software : VirtualBox 5.1.12
HBase virtual machine -
Operating system : Ubuntu (64 bit)
Memory : 4GB
Processor :1
Cassandra Virtual Machine –
Operating system : Ubuntu (64 bit)
Memory : 4GB
Processor :1
Benchmarking Application –
Yahoo! Cloud Serving Benchmark, 0.11.0
Evaluation and Results:
The test is performed against the HBase and Cassandra database with YCSB benchmarking
operations- Operation A and operation D. To calculate the various test runs with results,
the average is calculated for three sets of results.
10. YCSB Workload A(Read Evaluation)
In the read operations, there’s a small drop in average latency for HBase between 40000 to
60000 read operations and the average latency rises for HBase in between 60000 to 80000
as shown in the graph above.
491.02
437.71
812.25
688.81
515.68
685.64
562.19 583.13
547.56 568.72
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000 120000 140000 160000
ReadAverageLatency
Read Operations
Read Operations against Read Average Latency
Hbase
Cassandra
Workload A
HBase Cassandra
Read operations Read Average latency Read Operations Read Average latency
25135 491.02 25033 685.64
50030 437.71 50142 562.19
75030 812.25 74938 583.13
100204 688.81 99511 547.56
149929 515.68 149704 568.72
12. There’s a subsequent rise in the average latency for HBase between 60000 to 80000 as
seen from the graph above. For Cassandra the average latency falls down between 40000 to
60000.
YCSB Workload D (Read Evaluation)
In the workload D, the graph depicts the average latency for HBase is increasing whereas
for Cassandra it is decreasing.
332.95
312.76
341.2 335.17
442.95
605.02
548.78
516.91
476.41
503.77
0
100
200
300
400
500
600
700
0 50000 100000 150000 200000 250000 300000
ReadAverageLatency
Read Operations
Read Operation against Read Average Latency
Hbase
Cassandra
Workload D
HBase Cassandra
Read operations Read Average latency Read Operations Read Average latency
47429 332.95 47517 605.02
95066 312.76 95039 548.78
142621 341.20 142689 516.91
189955 335.17 189883 476.41
284994 442.95 285269 503.77
14. For the insert operations, the average latency for Hbase falls and then rises as
depicted in the graph whereas for Cassandra it depletes down and maintains a
approx. constant value .
Records VS Throughput Workload A
1464.3
1617.39
895.23
1082.96
1429.33
1363.42
1735.36 1719.52
1841.37 1813.5
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 50000 100000 150000 200000 250000 300000 350000
Throughput
Records
Records against throughput Workload A
Hbase
Cassandra
Workload A
HBase Cassandra
Records Throughput Records Throughput
50000 1464.30 50000 1363.42
100000 1617.39 100000 1735.36
150000 895.23 150000 1719.52
200000 1082.96 200000 1841.37
300000 1429.33 300000 1813.50
15. As we can see from the graph as the records increase there is an effect on the throughput.
Records VS Throughput Workload D
Conclusion:
Both the databases have their own capability and are used for storing and accessing data.
Both of them have their own advantages and disadvantages being efficient in their own
fields but from the research above it looks like Cassandra is much more efficient than
2074.51
2554.66
2324.68
2456.54
1952.34
1436.3
1665.72
1780.98
1961.93 1901.99
0
500
1000
1500
2000
2500
3000
0 50000 100000 150000 200000 250000 300000 350000
Throughput
Records
Records against throughput Workload D
HBase Cassandra
Workload D
HBase Cassandra
Records Throughput Records Throughput
50000 2074.51 50000 1436.30
100000 2554.66 100000 1665.72
150000 2324.68 150000 1780.98
200000 2456.54 200000 1961.93
300000 1952.34 300000 1901.99
16. Hbase. Cassandra has been constant in any operations without much getting effected with
the latency or no. of records. So with the research above we can say Cassandra is much
more stable than Hbase.
References :
1. Apache Cassandra. http://incubator.apache.org/cassandra/
2. Google App Engine. http://appengine.google.com
3. SQL Data Services/Azure Services Platform.
http://www.microsoft.com/azure/data.mspx.
4. Storage Performance Council. http://www.storageperformance.org/home.
5. Yahoo! Query Language. http://developer.yahoo.com/yql/.
A. Arasu et al. Linear Road: a stream data management benchmark. In VLDB,
2004.
6. F. C. Botelho, D. Belazzougui, and M. Dietzfelbinger. Compress, hash and displace. In
Proc. of the 17th European Symposium on Algorithms, 2009.
7. B. White et al. An integrated experimental environment for distributed systems and
networks. In OSDI, 2002.
8. K. Yocum et al. Scalability and accuracy in a large-scale network emulator. In OSDI,
2002