3. Definition of Cassandra
Apache Cassandra™ is a free
Distributed…
High performance…
Extremely scalable…
Fault tolerant (i.e. no single point of failure)…
post-relational database solution. Cassandra can serve as
both real-time datastore (the “system of record”) for
online/transactional applications, and as a read-intensive
database for business intelligence systems.
5. Architecture Overview
Cassandra was designed with the understanding that
system/hardware failures can and do occur
Peer-to-peer, distributed system
All nodes the same
Data partitioned among all nodes in the cluster
Custom data replication to ensure fault tolerance
Read/Write-anywhere design
6. Architecture Overview
Each node communicates with each other through the Gossip
protocol, which exchanges information across the cluster every
second
A commit log is used on each node to capture write activity.
Data durability is assured
Data also written to an in-memory structure (memtable) and
then to disk once the memory structure is full (an SStable)
7. Architecture Overview
The schema used in Cassandra is mirrored after Google
Bigtable. It is a row-oriented, column structure
A keyspace is akin to a database in the RDBMS world
A column family is similar to an RDBMS table but is more
flexible/dynamic
A row in a column family is indexed by its key. Other columns
may be indexed as well
ID Name SSN DOB
Portfolio Keyspace
Customer Column Family
9. Partitioning
How data is partitioned across nodes
Replication
How data is duplicated across nodes
Cluster Membership
How nodes are added, deleted to the cluster
System Architecture
10. • Nodes are logically structured in Ring Topology.
• Hashed value of key associated with data partition
is used to assign it to a node in the ring.
• Hashing rounds off after certain value to support
ring structure.
• Lightly loaded nodes moves position to alleviate
highly loaded nodes.
Partitioning
11. Replication
Each data item is replicated at N (replication factor)
nodes.
Different Replication Policies
◦ Rack Unaware – replicate data at N-1 successive nodes after
its coordinator
◦ Rack Aware – uses ‘Zookeeper’ to choose a leader which tells
nodes the range they are replicas for
◦ Datacenter Aware – similar to Rack Aware but leader is
chosen at Datacenter level instead of Rack level.
13. Gossip Protocols
• Network Communication protocols inspired for real life
rumour spreading.
• Periodic, Pairwise, inter-node communication.
• Low frequency communication ensures low cost.
• Random selection of peers.
• Example – Node A wish to search for pattern in data
– Round 1 – Node A searches locally and then gossips with node
B.
– Round 2 – Node A,B gossips with C and D.
– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust.
14. Gossip Protocols
• Variety of Gossip Protocols exists
– Dissemination protocol
• Event Dissemination: multicasts events via gossip. high latency might
cause network strain.
• Background data dissemination: continuous gossip about information
regarding participating nodes
– Anti Entropy protocol
• Used to repair replicated data by comparing and reconciling
differences. This type of protocol is used in Cassandra to repair data
in replications.
15. Cluster Management
Uses Scuttleback (a Gossip protocol) to manage
nodes.
Uses gossip for node membership and to transmit
system control state.
Node Fail state is given by variable ‘phi’ which tells
how likely a node might fail (suspicion level)
instead of simple binary value (up/down).
This type of system is known as Accrual Failure
Detector.
16. Why Cassandra?
Gigabyte to Petabyte scalability
Linear performance gains through adding nodes
No single point of failure
Easy replication / data distribution
Multi-data center and Cloud capable
No need for separate caching layer
Tunable data consistency
Flexible schema design
Data Compression
CQL language (like SQL)
Support for key languages and platforms
No need for special hardware or software
17. Big Data Scalability
Capable of comfortably scaling to petabytes
New nodes = Linear performance increases
Add new nodes online
1
2
Double Throughput
Capabilities
1
2
3
4
18. No Single Point of Failure
All nodes the same
Customized replication affords tunable data
redundancy
Read/write from any node
Can replicate data among different physical data
center racks
19. Easy Replication / Data Distribution
Transparently handled by Cassandra
Multi-data center capable
Exploits all the benefits of Cloud computing
Able to do hybrid Cloud/On-premise setup
20. No Need for Caching Software
Peer-to-peer architecture removes need for special
caching layer and the programming that goes with it
The database cluster uses the memory from all
participating nodes to cache the data assigned to each
node
No irregularities between a memory cache and
database are encountered
Database Server
Memcached Servers
Application Servers Writes
Reads
21. Tunable Data Consistency
Choose between strong and eventual consistency (All
to any node responding) depending on the need
Can be done on a per-operation basis, and for both
reads and writes
Handles Multi-data center operations
1
2
3
4
5
6
Any
One
Quorum
Local_Quorum
Each_Quorum
All
Writes
One
Quorum
Local_Quorum
Each_Quorum
All
Reads
22. Flexible Schema
Dynamic schema design allows for much more flexible
data storage than rigid RDBMS
Handles structured, semi-structured, and unstructured
data. Counters also supported
No offline/downtime for schema changes
Supports primary and secondary indexes
ID Name SSN DOB
Portfolio Keyspace
Customer Column Family
23. Data Compression
Uses Google’s Snappy data compression algorithm
Compresses data on a per column family level
Internal tests at DataStax show up to 80%+
compression of raw data
No performance penalty (and some increases in
overall performance due to less physical I/O)!
24. CQL Language
Very similar to RDBMS SQL syntax
Create objects via DDL (e.g. CREATE…)
Core DML commands supported: INSERT, UPDATE,
DELETE
Query data with SELECT
1
2
3
4
5
6
SELECT *
FROM USERS
WHERE STATE = ‘TX’;
25. Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Digest Query
Digest Response Digest Response
Result
Client
Read repair if
digests differ
Read Operation
* Figure taken from Avinash Lakshman and Prashant Malik (authors of the paper) slides.
26. Facebook Inbox Search
• Cassandra developed to address this problem.
• 50+TB of user messages data in 150 node cluster
on which Cassandra is tested.
• Search user index of all messages in 2 ways.
– Term search : search by a key word
– Interactions search : search by a user id
Latency
Stat
Search
Interactions
Term
Search
Min 7.69 ms 7.78 ms
Median 15.69 ms 18.27 ms
Max 26.13 ms 44.41 ms
27. Comparison with MySQL
• MySQL > 50 GB Data
Writes Average : ~300 ms
Reads Average : ~350 ms
• Cassandra > 50 GB Data
Writes Average : 0.12 ms
Reads Average : 15 ms
• Stats provided by Authors using facebook data.