4. #Cassandra13
The Spotify backend
â˘âŻ Around 4000 servers in 4 datacenters
â˘âŻ Volumes
-⯠We have ~ 12 soccer ďŹelds of music
-⯠Streaming ~ 4 Wikipedias/second
-⯠~ 24 000 000 active users
5. #Cassandra13
The Spotify backend
â˘âŻ Specialized software powering Spotify
-⯠~ 70 services
-⯠Mostly Python, some Java
-⯠Small, simple services responsible for single task
6. #Cassandra13
Storage needs
â˘âŻ Used to be a pure PostgreSQL shop
â˘âŻ Postgres is awesome, but...
-⯠Poor cross-site replication support
-⯠Write master failure requires manual intervention
-⯠Sharding throws most relational advantages out the
window
7. #Cassandra13
Cassandra @ Spotify
â˘âŻ We started using Cassandra 2+ years ago
-⯠~ 24 services use it by now
-⯠~ 300 Cassandra nodes
-⯠~ 50 TB of data
â˘âŻ Back then, there was little information about how to design
eďŹcient, scalable storage schemas for Cassandra
8. #Cassandra13
Cassandra @ Spotify
â˘âŻ We started using Cassandra 2+ years ago
-⯠~ 24 services use it by now
-⯠~ 300 Cassandra nodes
-⯠~ 50 TB of data
â˘âŻ Back then, there was little information about how to design
eďŹcient, scalable storage schemas for Cassandra
â˘âŻ So we screwed up
â˘âŻ A lot
10. #Cassandra13
Read repair
â˘âŻ Repair from outages during regular read operation
â˘âŻ With RR, all reads request hash digests from all nodes
â˘âŻ Result is still returned as soon as enough nodes have replied
â˘âŻ If there is a mismatch, perform a repair
11. #Cassandra13
Read repair
â˘âŻ Useful factoid: Read repair is performed across all data
centers
â˘âŻ So in a multi-DC setup, all reads will result in requests being
sent to every data center
â˘âŻ We've made this mistake a bunch of times
â˘âŻ New in 1.1: dclocal_read_repair
12. #Cassandra13
Row cache
â˘âŻ Cassandra can be conďŹgured to cache entire data rows in
RAM
â˘âŻ Intended as a memcache alternative
â˘âŻ Lets enable it. What's the worst that could happen, right?
13. #Cassandra13
Row cache
NO!
â˘âŻ Only stores full rows
â˘âŻ All cache misses are silently promoted to full row slices
â˘âŻ All writes invalidate entire row
â˘âŻ Don't use unless you understand all use cases
14. #Cassandra13
Compression
â˘âŻ Cassandra supports transparent compression of all data
â˘âŻ Compression algorithm (snappy) is super fast
â˘âŻ So you can just enable it and everything will be better, right?
15. #Cassandra13
Compression
â˘âŻ Cassandra supports transparent compression of all data
â˘âŻ Compression algorithm (snappy) is super fast
â˘âŻ So you can just enable it and everything will be better, right?
â˘âŻ NO!
â˘âŻ Compression disables a bunch of fast paths, slowing down
fast reads
17. #Cassandra13
Performance worse over time
â˘âŻ A freshly loaded Cassandra cluster is usually snappy
â˘âŻ But when you keep writing to the same columns over for a
long time, the row will spread over more SSTables
â˘âŻ And performance jumps oďŹ a cliďŹ
â˘âŻ We've seen clusters where reads touch a dozen SSTables on
average
â˘âŻ nodetool cfhistograms is your friend
18. #Cassandra13
Performance worse over time
â˘âŻ CASSANDRA-5514
â˘âŻ Every SSTable stores ďŹrst/last column of SSTable
â˘âŻ Time series-like data is eďŹectively partitioned
19. #Cassandra13
Few cross continent clusters
â˘âŻ Few cross continent Cassandra users
â˘âŻ We are kind of on our own when it comes to some problems
â˘âŻ CASSANDRA-5148
â˘âŻ Disable TCP nodelay
â˘âŻ Reduced packet count by 20 %
21. #Cassandra13
How not to upgrade Cassandra
â˘âŻ Very few total cluster outages
-⯠Clusters have been up and running since the early 0.7
days, been rolling upgraded, expanded, full hardware
replacements etc.
â˘âŻ Never lost any data!
-⯠No matter how spectacularly Cassandra fails, it has
never written bad data
-⯠Immutable SSTables FTW
22. #Cassandra13
Upgrade from 0.7 to 0.8
â˘âŻ This was the ďŹrst big upgrade we did, 0.7.4 âž 0.8.6
â˘âŻ Everyone claimed rolling upgrade would work
-⯠It did not
â˘âŻ One would expect 0.8.6 to have this ďŹxed
â˘âŻ Patched Cassandra and rolled it a day later
â˘âŻ Takeaways:
-⯠ALWAYS try rolling upgrades in a testing environment
-⯠Don't believe what people on the Internet tell you
23. #Cassandra13
Upgrade from 0.8 to 1.0
â˘âŻ We tried upgrading in test env, worked ďŹne
â˘âŻ Worked ďŹne in production...
â˘âŻ Except the last cluster
â˘âŻ All data gone
24. #Cassandra13
Upgrade from 0.8 to 1.0
â˘âŻ We tried upgrading in test env, worked ďŹne
â˘âŻ Worked ďŹne in production...
â˘âŻ Except the last cluster
â˘âŻ All data gone
â˘âŻ Many keys per SSTable âž corrupt bloom ďŹlters
â˘âŻ Made Cassandra think it didn't have any keys
â˘âŻ Scrub data âž ďŹxed
â˘âŻ Takeaway: ALWAYS test upgrades using production data
25. #Cassandra13
Upgrade from 1.0 to 1.1
â˘âŻ After the previous upgrades, we did all the tests with
production data and everything worked ďŹne...
â˘âŻ Until we redid it in production, and we had reports of missing
rows
â˘âŻ Scrub âž restart made them reappear
â˘âŻ This was in December, have not been able to reproduce
â˘âŻ PEBKAC?
â˘âŻ Takeaway: ?
29. #Cassandra13
What happens if one node is slow?
Many reasons for temporary slowness:
â˘âŻ Bad raid battery
â˘âŻ Sudden bursts of compaction/repair
â˘âŻ Bursty load
â˘âŻ Net hiccup
â˘âŻ Major GC
â˘âŻ Reality
30. #Cassandra13
What happens if one node is slow?
â˘âŻ Coordinator has a request queue
â˘âŻ If a node goes down completely, gossip will notice quickly
and drop the node
â˘âŻ But what happens if a node is just super slow?
31. #Cassandra13
What happens if one node is slow?
â˘âŻ Gossip doesn't react quickly to slow nodes
â˘âŻ The request queue for the coordinator on every node in
the cluster ďŹlls up
â˘âŻ And the entire cluster stops accepting requests
32. #Cassandra13
What happens if one node is slow?
â˘âŻ Gossip doesn't react quickly to slow nodes
â˘âŻ The request queue for the coordinator on every node in
the cluster ďŹlls up
â˘âŻ And the entire cluster stops accepting requests
â˘âŻ No single point of failure?
33. #Cassandra13
What happens if one node is slow?
â˘âŻ Solution: Partitioner awareness in client
â˘âŻ Max 3 nodes go down
â˘âŻ Available in Astyanax
35. #Cassandra13
How not to delete data
How is data deleted?
â˘âŻ SSTables are immutable, we can't remove the data
â˘âŻ Cassandra creates tombstones for deleted data
â˘âŻ Tombstones are versioned the same way as any other
write
36. #Cassandra13
How not to delete data
Do tombstones ever go away?
â˘âŻ During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
â˘âŻ Once a tombstone is the only value for a speciďŹc column,
the tombstone can go away
â˘âŻ Still need grace time to handle node downtime
37. #Cassandra13
How not to delete data
â˘âŻ Tombstones can only be deleted once all non-tombstone
values have been deleted
â˘âŻ Tombstones can only be deleted if all values for the
speciďŹed row are all being compacted
â˘âŻ If you're using SizeTiered compaction, 'old' rows will
rarely get deleted
38. #Cassandra13
How not to delete data
â˘âŻ Tombstones are a problem even when using levelled
compaction
â˘âŻ In theory, 90 % of all rows should live in a single SSTable
â˘âŻ In production, we've found that only 50 - 80 % of all reads
hit only one SSTable
â˘âŻ In fact, frequently updated columns will exist in most
levels, causing tombstones to stick around
39. #Cassandra13
How not to delete data
â˘âŻ Deletions are messy
â˘âŻ Unless you perform major compactions, tombstones will
rarely get deleted
â˘âŻ The problem is much worse for ÂŤpopularÂť rows
â˘âŻ Avoid schemas that delete data!
40. #Cassandra13
TTL:ed data
â˘âŻ Cassandra supports TTL:ed data
â˘âŻ Once TTL:ed data expires, it should just be compacted
away, right?
â˘âŻ We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
41. #Cassandra13
TTL:ed data
â˘âŻ Cassandra supports TTL:ed data
â˘âŻ Once TTL:ed data expires, it should just be compacted
away, right?
â˘âŻ We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
â˘âŻ Noooooo...
â˘âŻ (Overwritten data could theoretically bounce back)
43. #Cassandra13
The Playlist service
Our most complex service
â˘âŻ ~ 1 billion playlists
â˘âŻ 40 000 reads per second
â˘âŻ 22 TB of compressed data
44. #Cassandra13
The Playlist service
Our old playlist system had many problems:
â˘âŻ Stored data across hundreds of millions of ďŹles, making
backup process really slow.
â˘âŻ Home brewed replication model that didn't work very well
â˘âŻ Frequent downtimes, huge scalability problems
45. #Cassandra13
The Playlist service
Our old playlist system had many problems:
â˘âŻ Stored data across hundreds of millions of ďŹles, making
backup process really slow.
â˘âŻ Home brewed replication model that didn't work very well
â˘âŻ Frequent downtimes, huge scalability problems
â˘âŻ Perfect test case for
Cassandra!
46. #Cassandra13
Playlist data model
â˘âŻ Every playlist is a revisioned object
â˘âŻ Think of it like a distributed versioning system
â˘âŻ Allows concurrent modiďŹcation on multiple oďŹined clients
â˘âŻ We even have an automatic merge conďŹict resolver that
works really well!
â˘âŻ That's actually a really useful feature
47. #Cassandra13
Playlist data model
â˘âŻ Every playlist is a revisioned object
â˘âŻ Think of it like a distributed versioning system
â˘âŻ Allows concurrent modiďŹcation on multiple oďŹined clients
â˘âŻ We even have an automatic merge conďŹict resolver that
works really well!
â˘âŻ That's actually a really useful feature said no one ever
48. #Cassandra13
Playlist data model
â˘âŻ Sequence of changes
â˘âŻ The changes are the authoritative data
â˘âŻ Everything else is optimization
â˘âŻ Cassandra pretty neat for storing this kind of stuďŹ
â˘âŻ Can use consistency level ONE safely
50. #Cassandra13
Tombstone hell
â˘âŻ The HEAD column family stores the sequence ID of the latest
revision of each playlist
â˘âŻ 90 % of all reads go to HEAD
â˘âŻ mlock
51. #Cassandra13
Tombstone hell
â˘âŻ Noticed that HEAD requests took several seconds for some
lists
â˘âŻ Easy to reproduce in cassandra-cli:
â˘âŻget playlist_head[utf8('spotify:user...')];
â˘âŻ 1-15 seconds latency; should be < 0.1 s
â˘âŻ Copy SSTables to development machine for investigation
52. #Cassandra13
Tombstone hell
â˘âŻ Noticed that HEAD requests took several seconds for some
lists
â˘âŻ Easy to reproduce in cassandra-cli:
â˘âŻget playlist_head[utf8('spotify:user...')];
â˘âŻ 1-15 seconds latency; should be < 0.1 s
â˘âŻ Copy SSTables to development machine for investigation
â˘âŻ Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!
54. #Cassandra13
Tombstone hell
â˘âŻ We expected tombstones would be deleted after 30 days
â˘âŻ Nope, all tombstones since 1.5 years ago were there
â˘âŻ Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
â˘âŻ Frequently updated lists exists in nearly all SSTables
Solution:
â˘âŻ Major compaction (CF size cut in half)
55. #Cassandra13
Zombie tombstones
â˘âŻ Ran major compaction manually on all nodes during a few
days.
â˘âŻ All seemed well...
â˘âŻ But a week later, the same lists took several seconds
againâ˝â˝â˝
56. #Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
â˘âŻ Repairs during Monday-Friday
â˘âŻ Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues
57. #Cassandra13
Cassandra counters
â˘âŻ There are lots of places in the Spotify UI where we count
things
â˘âŻ # of followers of a playlist
â˘âŻ # of followers of an artist
â˘âŻ # of times a song has been played
â˘âŻ Cassandra has a feature called distributed counters that
sounds suitable
â˘âŻ Is this awesome?
60. #Cassandra13
How not to fail
â˘âŻ Treat Cassandra as a utility belt
â˘âŻ Flash
Lots of one-oďŹ solutions:
â˘âŻ Weekly major compactions
â˘âŻ Delete all sstables and recreate from scratch every day
â˘âŻ Memlock frequently used SSTables in RAM
61. #Cassandra13
Lessons
â˘âŻ Cassandra read performance is heavily dependent on the
temporal patterns of your writes
â˘âŻ Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
â˘âŻ Making benchmarks close to useless
62. #Cassandra13
Lessons
â˘âŻ Avoid repeatedly writing data to the same row over very
long spans of time
â˘âŻ Avoid deleting data
â˘âŻ If you're working at scale, you'll need to know how
Cassandra works under the hood
â˘âŻ nodetool cfhistograms is your friend
63. #Cassandra13
Lessons
â˘âŻ There are still various esoteric problems with large scale
Cassandra installations
â˘âŻ Debugging them is really interesting
â˘âŻ If you agree with the above statements, you should totally
come work with us