Apache Cassandra is a scalable, fault-tolerant database that has found its way into more than 25% of the Fortune 100 and continues to enjoy significant adoption in the marketplace. In this talk we'll introduce you to Cassandra, explore some of its internals, and discuss CQL (the SQL-like query language for Cassandra). We'll finish by talking about how some companies are using it for services you probably interact with in your daily life. You'll leave with all the tools you need to start exploring Cassandra on your own.
1. Introduction to Apache Cassandra
Luke Tillman (@LukeTillman)
Language Evangelist at DataStax
2. Who are you?!
•Evangelist with a focus on the .NET Community
•Long-time Developer
•Recently presented at Cassandra Summit 2014 with Microsoft
•Very Recent Denver Transplant
2
3. DataStax and Cassandra
•DataStax Enterprise
–Apache Cassandra, now with more QA!
–Easy integrations with Solr, Apache Spark, Hadoop
•Dev and Ops Tooling
–DevCenter IDE, OpsCenter
•Open source drivers
–Java, C#, Python, C++, Ruby, NodeJS
3
4. •Unlimited, free use of DataStax Enterprise
•No limit on number of nodes or other hidden restrictions
•If you’re a startup, it’s free.
•Requirements:
–< $2M annual revenue, < $20M capital raised
4
www.datastax.com/startups
5. 1
What is Cassandra?
2
How does it work?
3
Cassandra Query Language (CQL)
4
Who’s using it?
5
Questions
5
7. What is Cassandra?
•A Linearly Scaling and Fault Tolerant Distributed Database
•Fully Distributed
–Data spread over many nodes
–All nodes participate in a cluster
–All nodes are equal
–No SPOF (shared nothing)
7
8. What is Cassandra?
•Linearly Scaling
–Have More Data? Add more nodes.
–Need More Throughput? Add more nodes.
8
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
9. What is Cassandra?
•Fault Tolerant
–Nodes Down != Database Down
–Datacenter Down != Database Down
9
10. What is Cassandra?
•Fully Replicated
•Clients write local
•Data syncs across WAN
•Replication Factor per DC
10
US
Europe
Client
11. Cassandra and the CAP Theorem
•The CAP Theorem limits what distributed systems can do
•Consistency
•Availability
•Partition Tolerance
•Limits? “Pick 2 out of 3”
11
12. Cassandra and the CAP Theorem
Consistency
•When I ask the same question to any part of the system, I should get the same answer
12
Is he guilty yet?
No.
No.
No.
Consistent
13. Cassandra and the CAP Theorem
Consistency
•When I ask the same question to any part of the system, I should get the same answer
13
Is he guilty yet?
No.
Yes.
Yes.
Not Consistent
14. Cassandra and the CAP Theorem
Availability
•When I ask a question, I will get an answer
14
Is he guilty yet?
Yes.
Available
15. Cassandra and the CAP Theorem
Availability
•When I ask a question, I will get an answer
15
Is he guilty yet?
I don’t know, we have to wait for Dreamy to wake up.
Not Available
16. Cassandra and the CAP Theorem
Partition Tolerance
•I can ask questions even when the system is having intra-system communication problems.
16
Is he guilty yet?
Tolerant
No.
Team Tyrion
Team Cersei
17. Cassandra and the CAP Theorem
Partition Tolerance
•I can ask questions even when the system is having intra-system communication problems.
17
Is he guilty yet?
Not Tolerant
I’m not sure without asking them and we’re not speaking (I’m pretty sure that one helped kill my sister).
Team Tyrion
Team Cersei
18. Cassandra and the CAP Theorem
•Cassandra is an AP system that is Eventually Consistent
18
Is he guilty yet?
No.
Wait, he’s going to take the black. Yes.
No.
Eventually Consistent
19. Cassandra and the CAP Theorem
•Cassandra is an AP system that is Eventually Consistent
19
Is he guilty yet?
Yes.
Yes.
Eventually Consistent
Yes.
21. Two knobs control Cassandra fault tolerance
•Replication Factor (server side)
–How many copies of the data should exist?
21
Client
B
AD
C
AB
A
CD
D
BC
Write A
RF=3
22. Two knobs control Cassandra fault tolerance
•Consistency Level (client side)
–How many replicas do we need to hear from before we acknowledge?
22
Client
B
AD
C AB
A
CD
D
BC
Write A
CL=QUORUM
Client
B
AD
C
AB
A CD
D
BC
Write A
CL=ONE
23. Consistency Levels
•Applies to both Reads and Writes (i.e. is set on each query)
•ONE – one replica from any DC
•LOCAL_ONE – one replica from local DC
•QUORUM – 51% of replicas from any DC
•LOCAL_QUORUM – 51% of replicas from local DC
•ALL – all replicas
•TWO
23
24. Consistency Level and Speed
•How many replicas we need to hear from can affect how quickly we can read and write data in Cassandra
24
Client
B
AD
C AB
A
CD
D
BC
5 μs ack
300 μs ack
12 μs ack
12 μs ack
Read A
(CL=QUORUM)
25. Consistency Level and Availability
•Consistency Level choice affects availability
•For example, QUORUM can tolerate one replica being down and still be available (in RF=3)
25
Client
B
AD
C
AB
A CD
D
BC
A=2
A=2
A=2
Read A
(CL=QUORUM)
26. Consistency Level and Eventual Consistency
•Cassandra is an AP system that is Eventually Consistent so replicas may disagree
•Column values are timestamped
•In Cassandra, Last Write Wins (LWW)
26
Client
B AD
C AB
A
CD
D
BC
A=2
Newer
A=1 Older
A=2
Read A
(CL=QUORUM)
Christos from Netflix: “Eventual Consistency != Hopeful Consistency” https://www.youtube.com/watch?v=lwIA8tsDXXE
27. Writes in the cluster
•Fully distributed, no SPOF
•Node that receives a request is the Coordinator for request
•Any node can act as Coordinator
27
Client
B
AD
C
AB
A CD
D BC
Write A
(CL=ONE)
Coordinator Node
28. Writes in the cluster – Data Distribution
•Partition Key determines node placement
28
Partition Key
id='pmcfadin'
lastname='McFadin'
id='jhaddad'
firstname='Jon'
lastname='Haddad'
id='ltillman'
firstname='Luke'
lastname='Tillman'
CREATE TABLE users ( id text, firstname text, lastname text, PRIMARY KEY (id) );
29. Writes in the cluster – Data Distribution
•The Partition Key is hashed using a consistent hashing function (Murmur 3) and the output is used to place the data on a node
•The data is also replicated to RF-1 other nodes
29
Partition Key
id='ltillman'
firstname='Luke'
lastname='Tillman'
Murmur3
id: ltillman
Murmur3: A
B
AD
C AB
A
CD
D
BC
RF=3
30. Hashing – Back to Reality
•Back in reality, Partition Keys actually hash to 128 bit numbers
•Nodes in Cassandra own token ranges (i.e. hash ranges)
30
B AD
C
AB
A
CD
D BC
Range
Start
End
A
0xC000000..1
0x0000000..0
B
0x0000000..1
0x4000000..0
C
0x4000000..1
0x8000000..0
D
0x8000000..1
0xC000000..0
Partition Key
id='ltillman'
Murmur3
0xadb95e99da887a8a4cb474db86eb5769
31. Writes on a single node
•Client makes a write request
Client
UPDATE users SET firstname = 'Luke' WHERE id = 'ltillman'
Disk
Memory
32. Writes on a single node
•Data is appended to the Commit Log
•Cassandra writes are FAST due to log appended storage
Client
UPDATE users
SET firstname = 'Luke'
WHERE id = 'ltillman'
Commit Log
id='ltillman', firstname='Luke'
…
…
Disk
Memory
33. Writes on a single node
•Data is written to Memtable
Client
UPDATE users
SET firstname = 'Luke'
WHERE id = 'ltillman'
Commit Log
id='ltillman', firstname='Luke'
…
…
Disk
Memory
Memtable for Users
Some Other Memtable
id='ltillman'
firstname='Luke'
lastname='Tillman'
34. Writes on a single node
•Server acknowledges to client
Client
UPDATE users
SET firstname = 'Luke'
WHERE id = 'ltillman'
Commit Log
id='ltillman', firstname='Luke'
…
…
Disk
Memory
Memtable for Users
Some Other Memtable
id='ltillman'
firstname='Luke'
lastname='Tillman'
35. Writes on a single node
•Once Memtable is full, data is flushed to disk as SSTable (Sorted String Table)
Client
UPDATE users SET firstname = 'Luke' WHERE id = 'ltillman'
Data Directory
Disk
Memory
Memtable for Users
Some Other Memtable
id='ltillman'
firstname='Luke'
lastname='Tillman'
Some Other SSTable
SSTable #1 for Users
SSTable #2 for Users
36. Compaction
•Compactions merge and unify data in our SSTables
•SSTables are immutable, so this is when we consolidate rows
36
SSTable #1 for Users
SSTable #2 for Users
SSTable #3 for Users
id='ltillman'
firstname='Lucas' (timestamp=Older)
lastname='Tillman'
id='ltillman'
firstname='Luke'
lastname='Tillman'
id='ltillman'
firstname='Luke' (timestamp=Newer)
37. Reads in the cluster
•Same as writes in the cluster, reads are coordinated
•Any node can be the Coordinator Node
37
Client
B AD
C
AB
A CD
D
BC
Read A
(CL=QUORUM)
Coordinator Node
38. Reads on a single node
•Client makes a read request
38
Client
SELECT firstname, lastname FROM users WHERE id = 'ltillman'
Disk
Memory
39. Reads on a single node
•Data is read from (possibly multiple) SSTables and merged
•Reads in Cassandra are also FAST but are limited by Disk IO
39
Client
SELECT firstname, lastname FROM users WHERE id = 'ltillman'
Disk
Memory
SSTable #1 for Users
id='ltillman'
firstname='Lucas' (timestamp=Older)
lastname='Tillman'
SSTable #2 for Users
id='ltillman'
firstname='Luke'
(timestamp=Newer)
firstname='Luke'
lastname='Tillman'
40. Reads on a single node
•Any unflushed Memtable data is also merged
40
Client
SELECT firstname, lastname
FROM users
WHERE id = 'ltillman'
Disk
Memory
firstname='Luke'
lastname='Tillman'
Memtable for Users
41. Reads on a single node
•Client gets acknowledgement with the data
41
Client
SELECT firstname, lastname
FROM users
WHERE id = 'ltillman'
Disk
Memory
firstname='Luke'
lastname='Tillman'
42. Compaction - Revisited
•Compactions merge and unify data in our SSTables, making them important to reads (less SSTables = less to read/merge)
42
SSTable #1 for Users
SSTable #2 for Users
SSTable #3 for Users
id='ltillman'
firstname='Lucas' (timestamp=Older)
lastname='Tillman'
id='ltillman'
firstname='Luke'
lastname='Tillman'
id='ltillman'
firstname='Luke' (timestamp=Newer)
44. Data Structures
•Keyspace is like RDBMS Database or Schema
•Like RDBMS, Cassandra uses Tables to store data
•Partitions can have one row (narrow) or multiple rows (wide)
44
Keyspace
Tables
Partitions
Rows
45. Schema Definition (DDL)
•Easy to define tables for storing data
•First part of Primary Key is the Partition Key
CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) );
47. Clustering Columns
•Second part of Primary Key is Clustering Columns
•Clustering columns affect ordering of data (on disk)
•Multiple rows per partition
47
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
48. Clustering Columns – Wide Rows (Partitions)
•Use of Clustering Columns is where the term “Wide Rows” comes from
48
videoid='0fe6a...'
userid=
'ac346...'
comment= 'Awesome!'
commentid='82be1...'
(10/1/2014 9:36AM)
userid= 'f89d3...'
comment=
'Garbage!'
commentid='765ac...' (9/17/2014 7:55AM)
CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
49. Inserts and Updates
•Use INSERT or UPDATE to add and modify data
•Both will overwrite data (no constraints like RDBMS)
•INSERT and UPDATE functionally equivalent
49
INSERT INTO comments_by_video (
videoid, commentid, userid, comment)
VALUES (
'0fe6a...', '82be1...', 'ac346...', 'Awesome!');
UPDATE comments_by_video SET userid = 'ac346...', comment = 'Awesome!' WHERE videoid = '0fe6a...' AND commentid = '82be1...';
50. TTL and Deletes
•Can specify a Time to Live (TTL) in seconds when doing an INSERT or UPDATE
•Use DELETE statement to remove data
•Can optionally specify columns to remove part of a row
50
INSERT INTO comments_by_video ( ... )
VALUES ( ... )
USING TTL 86400;
DELETE FROM comments_by_video WHERE videoid = '0fe6a...' AND commentid = '82be1...';
51. Querying
•Use SELECT to get data from your tables
•Always include Partition Key and optionally Clustering Columns
•Can use ORDER BY and LIMIT
•Use range queries (for example, by date) to slice partitions
51
SELECT * FROM comments_by_video
WHERE videoid = 'a67cd...'
LIMIT 10;
52. Cassandra Data Modeling
•Requires a different mindset than RDBMS modeling
•Know your data and your queries up front
•Queries drive a lot of the modeling decisions (i.e. “table per query” pattern)
•Denormalize/Duplicate data at write time to do as few queries as possible come read time
•Remember, disk is cheap and writes in Cassandra are FAST
52
53. Cassandra Data Modeling – A Quick Example
•Users need to be looked up by a unique Id, but when logging in, need to look them up by email address
•Some data is duplicated (email, userid) but that’s OK
53
CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, PRIMARY KEY (userid) );
CREATE TABLE users_by_email (
email text,
password text,
userid uuid,
PRIMARY KEY (email)
);
56. Some Common Use Case Categories
•Product Catalogs and Playlists
•Internet of Things (IoT) and Sensor Data
•Messaging (emails, IMs, alerts, comments)
•Recommendation and Personalization
•Fraud Detection
•Time series and temporal ordered data
http://planetcassandra.org/apache-cassandra-use-cases/
57. The “Slide Heard Round the World”
•From Cassandra Summit 2014, got a lot of attention
•75,000+ nodes
•10s of PBs of data
•Millions ops/s
•One of the largest known Cassandra deployments
57
58. Spotify
•Streaming music web service
•> 24,000,000 music tracks
•> 50TB of data in Cassandra
Why Cassandra?
•Was PostgreSQL, but hit scaling problems
•Multi Datacenter Availability
•Integration with Spark for data processing and analytics
Usage
•Catalog
•User playlists
•Artists following
•Radio Stations
•Event notifications
58
http://planetcassandra.org/blog/interview/spotify-scales-to-the-top-of-the-charts-with-apache-cassandra-at-40k-requestssecond/
59. eBay
•Online auction site
•> 250TB of data, dozens of nodes, multiple data centres
•> 6 billion writes, > 5 billion reads per day
Why Cassandra?
•Low latency, high scale, multiple data centers
•Suited for graph structures using wide rows
Usage
•Building next generation of recommendation engine
•Storing user activity data
•Updating models of user interests in real time
59
http://planetcassandra.org/blog/5-minute-c-interview-ebay/
60. FullContact
•Contact management: from multiple sources, sync, de-dupe, APIs available
•2 clusters, dozens of nodes, running in AWS
•Based here in Denver
Why Cassandra?
•Migated from MongoDB after running into scaling issues
•Operational simplicity
•Resilience and Availability
Usage
•Person API (search by email, Twitter handle, Facebook, or phone)
•Searched data from multiple sources (ingested by Hadoop M/R jobs)
•Resolved profiles
60
http://planetcassandra.org/blog/fullcontact-readies-their-search-platform-to-scale-moves-from-mongodb-to-apache-cassandra/
61. Instagram
•Photo-sharing, video-sharing and social networking service
•Originally AWS (Now Facebook data centers?)
•> 20k writes/second, >15k reads/second
Why Cassandra?
•Migrated from Redis (problems keeping everything in memory)
•No painful “sharding” process
•75% reduction in costs
Usage
•Auditing information – security, integrity, spam detection
•News feed (“inboxes” or activity feed)
–Likes, Follows, etc.
61
http://planetcassandra.org/blog/instagram-making-the-switch-to-cassandra-from-redis-75-instasavings/ Summit 2014 Presentation: https://www.youtube.com/watch?v=_gc94ITUitY
62. Netflix
•TV and Movie streaming service
•> 2700+ nodes on over 90 clusters
•4 Datacenters
•> 1 Trillion operations per day
Why Cassandra?
•Migrated from Oracle
•Massive amounts of data
•Multi datacenter, No SPOF
•No downtime for schema changes
Usage
•Everything! (Almost – 95% of DB use)
•Example: Personalization
–What titles do you play?
–What do you play before/after?
–Where did you pause?
–What did you abandon watching after 5 minutes?
62
http://planetcassandra.org/blog/case-study-netflix/ Summit 2014 Presentation: https://www.youtube.com/watch?v=RMSNLP_ORg8&index=43&list=UUvP-AXuCr-naAeEccCfKwUA