This document introduces Apache Cassandra, a distributed column-oriented NoSQL database. It discusses Cassandra's architecture, data model, query language (CQL), and how to install and run Cassandra. Key points covered include Cassandra's linear scalability, high availability and fault tolerance. The document also demonstrates how to use the nodetool utility and provides guidance on backing up and restoring Cassandra data.
2. GÖKHAN ATIL
➤ Database Administrator
➤ Oracle ACE Director (2016)
ACE (2011)
➤ 10g/11g and R12 Oracle Certified Professional (OCP)
➤ Co-author of Expert Oracle Enterprise Manager 12c
➤ Founding Member and Vice President of TROUG
➤ Blogger (since 2008) gokhanatil.com
➤ Twitter: @gokhanatil
2
3. INTRODUCTION TO APACHE CASSANDRA
➤ What is Apache Cassandra? Why to use it?
➤ Cassandra Architecture
➤ Cassandra Query Language (CQL)
➤ Cassandra Data Modeling
➤ How to install and run Cassandra?
➤ Cassandra nodetool
➤ Backup and Recovery
3
5. WHAT IS APACHE CASSANDRA? WHY TO USE IT?
➤ Fast Distributed (Column Family NoSQL) Database
High availability
Linear Scalability
High Performance
➤ Fault tolerant on Commodity Hardware
➤ Multi-Data Center Support
➤ Easy to operate
➤ Proven: CERN, Netflix, eBay, GitHub, Instagram, Reddit
5
6. HIGH AVAILABILITY: CAP THEOREM AND CASSANDRA
6
Partition
Tolerance
Availability
Consistency
(ACID)
RDBMS
Atomicity
Consistency
Isolation
Durability
13. WRITE PATH (NODE)
➤ Logging data in the commit log
➤ Writing data to the memtable
➤ Flushing to (immutable)
SSTables (Sorted Strings Table)
13
memtable
commit log SSTable SSTable SSTable
disk
mem
flush
compaction
15. READ PATH (NODE)
15
memtable row (read) cache
bloom filter
(maybe or no)
partition key
cache
partition
summary
partition index SSTable
found
maybe
found
no
disk
mem
16. CONSISTENCY LEVELS
➤ Formula for Strong Consistency: R + W > N
16
ANY (write only) at least one node
ONE, TWO, THREE
at least one/two/three replica
node
QUORUM
a quorum (N/2+1) of replica
nodes across all datacenters
LOCAL_QUORUM
a quorum (N/2+1) of replica
nodes in the same datacenter
ALL on all replica nodes
18. CASSANDRA QUERY LANGUAGE (CQL)
➤ Create a Keyspace (Database):
create keyspace demo with replication = { 'class' :
'SimpleStrategy', 'replication_factor' :1 };
➤ Remove a keyspace:
drop keyspace demo;
➤ Select a keyspace to operate:
use demo;
18
19. CASSANDRA QUERY LANGUAGE (CQL)
➤ Create a table:
create table demo.democlients ( email text, name text,
phone text, primary key (email, name));
➤ Alter a table:
alter table democlients add money int;
➤ Remove a table:
drop table democlients;
➤ Remove all rows in a table:
truncate table democlients;
19
EMAIL: PARTITION KEY
NAME: CLUSTERING KEY
20. CASSANDRA QUERY LANGUAGE (CQL)
➤ Retrieve rows:
select * from democlients where name='Gokhan Atil'
ALLOW FILTERING; -- or create a secondary index
➤ Retrieve distinct values:
select DISTINCT email from democlients;
➤ Limit the number of rows returned:
select * from democlients LIMIT 1;
➤ Sort the result:
select * from democlients where email='gokhan at
gokhanatil.com' ORDER by name DESC;
20
NAME: CLUSTERING KEY
EMAIL: PARTITION KEY
21. CASSANDRA QUERY LANGUAGE (CQL)
➤ Retrieve the results in the JSON format:
select JSON * from democlients;
➤ Insert a row:
insert into democlients (email, name, phone) values
('gokhan at gokhanatil.com','Gokhan Atil','542' ) IF NOT
EXISTS;
➤ Insert a row with TTL (Time to live - seconds):
insert into democlients (email, name, phone) values ('info
at gokhanatil.com','Information','542' ) USING TTL 10;
21
22. CASSANDRA QUERY LANGUAGE (CQL)
➤ Update records:
update democlients set phone='535' where
email='gokhan at gokhanatil.com' and
name='Gokhan' IF EXISTS;
➤ Update records with a condition:
update democlients set money=20 where email='gokhan
at gokhanatil.com' and name='Gokhan Atil'
IF phone='542';
➤ Delete rows:
delete from democlients where email='gokhan at
gokhanatil.com' IF EXISTS;
22
23. CASSANDRA QUERY LANGUAGE (CQL)
➤ Delete row with a condition:
delete from democlients where email='gokhan at
gokhanatil.com' and name='Gokhan Atil' IF money > 10;
➤ Delete columns in a row:
delete money from democlients where email='gokhan at
gokhanatil.com' and name='Gokhan Atil';
23
24. CASSANDRA DATA MODELING
➤ Query-Driven Data Modeling
➤ Spread data evenly across the cluster
➤ Use Denormalization
➤ Be careful about using secondary indexes
24
26. HOW TO INSTALL AND RUN CASSANDRA CLUSTER?
➤ Make sure you have JDK (8u40 or newer) installed
➤ Download apache-cassandra-VERSION-bin.tar.gz
➤ Extract the file to a folder
➤ Make data and logs directories in cassandra folder
➤ Run bin/cassandra
➤ Edit the configuration file (conf/cassandra.yaml)
➤ Give a name to cluster, change listening address, data and logs
directory locations, enable authentication and authorization.
26
27. HOW TO INSTALL AND RUN CASSANDRA CLUSTER?
➤ User docker to pull the latest image:
docker pull cassandra
➤ Run it as standalone:
docker run --name cas1 -p 9042:9042 -e
CASSANDRA_CLUSTER_NAME=MyCluster -d cassandra
➤ Connect using clqsh:
docker exec -it cas1 cqlsh
➤ Run nodetool (i.e for check status):
docker exec -it cas1 nodetool status
27
29. CASSANDRA NODETOOL
➤ Get a quick summary of the node:
nodetool info
➤ Get version of Cassandra:
nodetool version
29
30. CASSANDRA NODETOOL
➤ Get status of the cluster/keyspace:
nodetool status <keyspace_name>
➤ View the network statistics of the node:
nodetool netstats
➤ Get information of a table:
nodetool cfstats <keyspace_name.table_name>
30
31. CASSANDRA NODETOOL
➤ Repair a node (you can run it weekly on non-peak hours):
nodetool repair
➤ Cleanup of keys no longer belonging to a node:
nodetool cleanup
➤ Start a major compaction process:
nodetool compact
➤ Check the compaction process:
nodetool compactionstats
31
32. CASSANDRA NODETOOL
➤ Decommission a node (to prepare to remove it):
nodetool decommission <node_UUID>
➤ Remove a dead/or decommissioned node from the cluster:
nodetool removenode <node_UUID>
➤ Take a snapshot (for backup):
nodetool snapshot
➤ Remove previous snapshots:
nodetool clearsnapshot
32
34. BACKUP AND RECOVERY
➤ Back up a cluster:
1. Take a snapshot of each node.
2. Move the snapshots to another storage (S3 bucket?)
3. Clean all the snapshots
➤ Restore node(s):
➤ Make sure schema exists
➤ Truncate table
➤ Copy most recent snapshots to a directory. Its name should
be formatted as "keyspace/tablename". Run:
sstableloader -d <nodeip> keyspace/tablename
34
35. BUILD A BACKUP NODE
➤ Use multi-DC replication:
CREATE KEYSPACE "MyKeyspace"
WITH replication = {
'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3, 'datacenter2' : 1 };
35
RF=3
client
snapshots