2. Who am I and What do we do?
•Adam Hutson
•Data Architect of DataScale -> www.datascale.io
•DataStax MVP for Apache Cassandra
•DataScale provides hosted data platforms as a service
•Offering Cassandra & Spark, with more to come
•Currently hosted in Amazon or Azure
3. Fun Fact
•DataScale was purchased by DataStax last week
•It was publically made official today … Surprise!
5. What is Big Data?
Small Data - flat file, script-based
Medium Data - single server; typical RDBMS; ACID
Big Data - multiple servers; replication = lag; sharding = management headache;
expensive machines; no longer ACID; lots of people to maintain
6. Cassandra Overview
Distributed, database management system
Peer-to-peer design (no master, no slaves)
Can run on commodity machines
No single point of failure
Has linear scalability
Is a cluster/ring of equal machines divided into a ring of hash values
Chooses Availability & Partition over Consistency; Is eventually consistent
Data is replicated automatically
7. Cassandra Origins
Based on Amazon’s Dynamo and Google’s BigTable
Created at Facebook for the Inbox search system in 2007
Facebook open-sourced on Google code in 2008
Became an Apache Incubator project in 2009
By 2010, it graduated to a top-level project at Apache
Apache Cassandra can be run completely royalty-free
DataStax offers a licensed/corporate version with additional tools/integrations
8. Why Cassandra over traditional RDBMS?
RDBMS Pros:
Single Machine
ACID guarantees
Scales vertical (bigger machine)
RDBMS Cons:
Growing past Single Machine
Scale horizontal = Replication lag
Sharding = Complicated codebase
Failover = on-call headache
9. Why Cassandra over traditional RDBMS?
Cassandra Pros:
Scale horizontal
Shard-less
Failure indifferent
CAP: AP with tunable C
Cassandra Cons:
Eventually Consistent
10. ACID
Atomicity = All or nothing transactions
Consistency = Guarantees committed transaction state
Isolation = Transactions are independent
Durability = Committed data is never lost
Nice warm, fuzzy that transactions are perfect.
Cassandra follows the A, I, & D in ACID, but not C; is Eventually Consistent
In NoSQL, we make trade-offs to serve the greatest need
11. CAP Theorem
Consistency = all nodes see the same data at the same time
Availability = every request gets response about whether it succeeded or failed
Partition Tolerance = the system continues to operate despite partition failures
You have to choose 2: CA, CP, or AP
Cassandra chooses AP; To be highly available in a network partition
12.
13. Architecture Terms
Nodes: where data is stored; typically a logical machine
Data Center: collection of nodes; assigned replication; logical workload grouping;
should not span physical location
Cluster: one or more data center(s); can span physical location
Commit Log: first stop for any write; provides durability
Table: collection of ordered columns fetched by row;
SSTable: sorted string table; immutable file; append only; sequentially stored
14. Architecture Components
Gossip: peer-to-peer communication protocol; discovers/shares data & locations
Partitioner: determines how to distribute data to replicas
Replication Factor: determines how many replicas to maintain
Replica Placement Strategy: determines which node(s) will contain replicas
Snitch: defines groups of nodes into data centers & racks that replication uses
15. What is a Node?
Just a small part of a big system
Represents a single machine (Server/VM/Container)
Has a JVM running the Cassandra Java process
Can run anywhere (RaspberryPi/laptop/cloud/on-premise)
Responsible for writing & reading it’s data
Typically 3-5,000 tps/core & 1-3 TB of data
16. Cluster
A cluster is a bunch of nodes that together are responsible for all the data.
Each node is responsible for a different range of the data, aka the token ranges.
A cluster can hold token values from -263 to 263-1.
The ring starts with the smallest number and circles clockwise to the largest.
17. We are using 0-99
to represent tokens.
Because -263 to 263-1 is
too hard to visualize
Node 1 is responsible
for tokens 76-0
Node 2 is responsible
for tokens 1-25
Node 3 is responsible
for tokens 26-50
Node 4 is responsible
for tokens 51-75
18. Replication
Replication is when we store replicas of data on multiple nodes to ensure reliability
and fault tolerance. The total number of replicas is the replication factor.
The cluster just shown had the most basic replication factor of 1 (RF=1).
Each node was responsible for its only own data.
What happens if a node is lost/corrupt/offline? → We need replicas.
Change the replication factor to 2 (RF=2)
Each node will be responsible for its own data and its neighbors data.
19. We are using 0-99
to represent tokens.
Because -263 to 263-1 is
too hard to visualize
Replication Factor = 2
Node 1 is responsible
for tokens 76-0 & 1-25
Node 2 is responsible
for tokens 1-25 & 26-50
Node 3 is responsible
for tokens 26-50 & 51-75
Node 4 is responsible
for tokens 51-75 & 76-0
20. Replication with Node Failure
Sooner or later, we’re going to lose a node. With replication, we’re covered.
Node 1 goes offline for maintenance, or someone turned off that rack.
I need row with token value of 67, but Node 1 was responsible for that range?
Node 4 saves the day because it has a replica of Node 1’s range.
21. Consistency
We want our data to be consistent.
We have to be aware that being Available and having Partition Tolerance is more
important than strictly consistent.
Consistency is tunable though. We can choose to have certain queries have
stronger consistency than others.
The client can specify a Consistency Level for each read or write query issued.
22. Consistency Level
ONE: only one replica has to acknowledge the read or write
QUORUM: 51% of the replicas have to acknowledge the read or write
ALL: all of the replicas have to acknowledge the read or write
With Multiple Data Centers:
LOCAL_ONE: only one replica in the local data center
LOCAL_QUORUM: 51% of the replicas in the local data center
23. Consistency & Replication working together
Scenario: 4 Nodes with Replication Factor of 3
Desire High Write Volume/Speed, Low Read frequency
Write of Consistency Level of ONE, will write to 1 node, replication will sync to other 2
Read of Consistency Level of QUORUM, will read from 2
Desire High Read Volume/Speed, Low Write frequency
Write of Consistency Level of QUORUM, will write to 2 nodes, replication will sync to other
1.
Read of Consistency Level of ONE, will read from 1 node
24. Peer-to-peer
Cassandra is not Client/Server or Master/Slave or Read-Write/Read-Only.
No routing data to shards; no holding leader elections, no split brains.
In Cassandra, every node is a peer to every other node.
Every instance of Cassandra is running the same java process.
Each node is independent and completely replaceable.
25. Gossip
One node tells its neighbor nodes
Its neighbor nodes tell their neighbor nodes
Their neighbor nodes tell more neighbor nodes
Soon, every node knows about every other node’s business.
“Node X is down”
“Node Y is joining”
“There’s a new data center”
“The east coast data center is gone”
26. Hinted Handoff
Failed writes will happen. When a write happens, the nodes are trying the new
data to get all the replicas.
If one of the replica nodes is offline, then the other replica nodes are going to
remember what data the down node was supposed to receive, aka keep hints.
When that node appears online again, then the other nodes that kept hints are
going to handoff those hints to the newly online node.
Hints are kept for a tunable period, defaults to 3 hours.
27. Write Path
Nothing to do with which node the data will be written to, and everything to do with
what happens internally in the node to get the data persisted.
When the data enters the Cassandra java process, the data is written to 2 places:
1. First appended to the Commit Log file.
2. Second to the MemTable.
The Commit Log is immediately durable and is persisted to disk.
The MemTable is a representation of the data in RAM.
Afterwards, the write is acknowledged and goes back to the client.
28. Write Path (cont’d)
Once the MemTable fills up,
then it flushes its data to
disk as an SSTable.
If the node crashes before
the data is flushed,
then the Commit Log
is replayed to re-populate
the MemTable.
29. Read Path
The idea is that it looks in the MemTable first, then in the SSTable.
The MemTable has the most recent partitions of data.
The SSTable files is sequential, but can get really, really big.
Partition Index file keeps track of partitions and the offset of their locations in the
SSTable, but they too can get large.
Summary Index file keeps track of offsets in Partition Index.
30. Read Path (cont’d)
We can go faster by using Key Cache, which is an in-memory index of partitions
and what it’s offset is in the SSTable. Skips the Partition & Summary Index files.
Only works on previously requested keys.
But which SSTable/Partition/Summary file should be looked at? Bloom Filters are
a probabilistic data structure. keep track by saying that the key you’re looking for
is “definitely not there” or “maybe it’s there”. The false positive “maybes” are rare,
but tunable.
31. Deleting Data
Data is never deleted in Cassandra. When a column value in a row, or an entire
row is requested to be deleted, Cassandra actually writes additional data that
marks the column or row with a timestamp that says it’s deleted. This is called a
tombstone.
Whenever a read occurs, it will skip over the tombstoned data and not return it.
Skipping over tombstones still incur I/O though. There’s even a metric that will tell
you the avg. tombstones being read. So we’ll need to remove them from the
SSTable at some point.
32. Compaction
Compaction is the act of merging SSTables together. But why → Housekeeping.
SSTables are immutable, so we can never update the file to change a column’s
value.
If your client writes the same data 3 times, there will be 3 entries in potentially 3
different SSTables. (This assumes the MemTable flushed the data in-between
writes).
Reads have to read all 3 SSTables and compare the write timestamps to get the
correct value.
33. Compaction
What happens is that Cassandra will compact the SSTables by reading in those 3
SSTables and writing out a new SSTable with only the single entry. The 3 older
SSTables will get deleted at that point.
Compaction is when tombstones are purged too. Tombstones are kept around
long enough so that we don’t get “phantom deletes” though (tunable period).
Compaction will keep your future seeks on disk low.
There’s a whole algorithm on when Compaction runs, but it’s automatically setup
by default.
34. Repairs
Repairs are needed because, over time, distributed data naturally gets out of sync
with all the locations. Repairs just makes sure that all your nodes are consistent.
Repairs happen at two times.
1. After each read, there is a (tunable) chance that a repair will occur. When a
client requests to read a particular key, a background process will gather all
the data from all the replicas, and update all the replicas to be consistent.
2. At scheduled times that are manually controlled by an admin.
35. Failure Recovery
Sometimes nodes go down due to maintenance, or a real catastrophe.
Cassandra will keep track of down nodes with gossip. Hints are automatically
held for a (tunable) period. So when/ifr the node comes back online, the other
nodes will tell it what it missed.
If the node doesn’t come back online, you have to create a new node to replace it.
Assign it the same tokens as the lost node, and the other nodes will stream the
necessary data to it from replicas.
36. Scaling
Scaling is when you add more capacity to the cluster. Typically, this is when you
add more nodes.
You create a new node(s) and add it to the cluster.
A new node will join the ring and take responsibility for a part of the token ranges.
While it’s joining, other nodes will stream data for the token ranges it will own.
Once it’s fully joined, the node will start participating in normal operations.
38. Python - Getting Started
Install the python driver via pip:
pip install cassandra-driver
In a .py file, create a cluster & session object and connect to it:
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect()
The cluster object represents a cassandra cluster on your localhost.
The session object will manage all connection pooling.
Only create a session object once in your codebase and reuse it throughout.
If not, that repeated initial connection establishment will eventually become a bottleneck.
39. Specifying a Cluster
Localhost is great for sandboxing, but soon a real cluster with real IPs will be needed.
cluster = Cluster(['54.211.95.95', '52.90.150.156', '52.87.198.119'])
This is the set of IPs from a demo cluster I’ve created.
40. Authorization
Every cluster should have at least PlainTextAuthProvider enabled.
import cassandra.auth
my_auth_provider = cassandra.auth.PlainTextAuthProvider(
username='adamhutson',
password='P@$$w0rD'
)
There are other methods of authorization, but this is the most common.
Add the above snippet before you create your Cluster object.
Then pass the my_auth_provider object to the Cluster’s auth_provider option key.
cluster = Cluster(auth_provider=my_auth_provider)
41. Keyspace Selection
There are 3 ways to specify a keyspace to use:
1. session = cluster.connect('my_keyspace')
2. session.set_keyspace('my_keyspace')
3. session.execute('select * from my_keyspace.my_table')
It doesn’t matter which way you chose to go, just be consistent with your selection.
Personally, I use choice #2, as I can run it before I interact with the database.
It keeps me from pinning myself to a single keyspace at session creation time.
It also keeps me from having to type out the keyspace name every time I write a DML statement.
42. Simple Statement
The first thing most will want to do is select some data out of Cassandra. Let’s use the system keyspace,
and retrieve some of the meta data about the cluster.
session.set_keyspace('system')
rows = session.execute('SELECT source_id, date, event_time, event_value FROM
time_series WHERE source_id = ''123'' and date = ''2016-09-01''')
for row in rows:
print row.source_id, row.date, row.event_time, row.event_value
43. Prepared/Bound Statement
Every time we run that Simple Statement from above, Cassandra has to compile the query.
What if you’re going to run the same select repeatedly.
That compile time will become a bottleneck.
session.set_keyspace('training')
prepared_stmt = session.prepare('SELECT source_id, date, event_time,
event_value FROM time_series WHERE source_id = ? and date = ?')
bound_stmt = prepared_stmt.bind(['123','2016-09-01'])
rows = session.execute(bound_stmt)
for row in rows:
print row.source_id, row.date, row.event_time, row.event_value
44. Batch Statement
A batch level of operations where they are applied atomically. Specify a BatchType
insert_user = session.prepare("INSERT INTO users (name, age) VALUES (?, ?)")
batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
for (name, age) in users_to_insert:
batch.add(insert_user, (name, age))
session.execute(batch)
Be careful. This can potentially be inserting to token ranges all over the cluster. Best practice is to use
Batches for inserting multiples into the same partition.
45. Consistency Level
Consistency Level can be specified at the query level.
Just need to import the necessary library and set it.
This setting will remain with the session until you destroy the object or set it to a different CL.
from cassandra import ConsistencyLevel
session.default_consistency_level = ConsistencyLevel.ONE
There are a bunch of session level options that you can specify.
There are most of the same options available at the Statement level.
46. Shutdown
This is so simple, but so important. Finish every script with the following:
cluster.shutdown()
If you don’t do this at the end of your python file, you will leak connections on the server side.
I’ve done it, and it was completely embarrassing. Learn from my mistakes. Don’t forget it.