Cassandra EU 2012 - Putting the X Factor into Cassandra

PUTTING THE X FACTOR INTO
CASSANDRA:
ADVENTURES IN COUNTING
MALCOLM BOX, CTO, TELLYBUG

1

INTRO
Malcolm Box, CTO & Co-Founder

@malcolmbox

malcolm@tellybug.com

http://tellybug.com

2

WHAT I’M TALKING ABOUT
How we started using Cassandra

How we use it to power the X Factor and Britain’s Got Talent apps

Counting - harder than you might think

What we learnt along the way

4

THE CHALLENGE
10-12 Million people watching these shows

TV tells them to buzz/clap/score....

....Servers melt

Design goals to handle 10K interactions/s

5

ROLL BACK 1 YEAR
We’d won BGT 2011 - our ﬁrst big talent show

Existing MySQL/Django/Python stack

Back of envelope calculations....oh dear

Needed something quickly that could cope with anticipated load

6

OUR FIRST CASSANDRA SCHEMA
create column family vote_log with comment='Log of votes'

and comparator='UTF8Type' and key_validation_class='UUIDType'

and default_validation_class='UTF8Type'

and column_metadata = [

{column_name:'ipaddr', validation_class:'AsciiType'},

{column_name:'poll', validation_class:'LongType'},

{column_name:'choice', validation_class:'LongType'},

{column_name:'idtoken', validation_class:'UTF8Type'},

{column_name:'count', validation_class:'LongType'}];

7

WHAT WE LEARNT
Cassandra scales beautifully for writes

Cassandra has no single point of failure

....but it’s not hard to make it fail

Ad-hoc questions and reporting were going to be much slower

8

OPERATIONS
BGT 2011 was a write only DB

Ignored failures

One cluster, one AZ

Backup to MySQL

9

Over 1 Million app
downloads

Over 260 Million boos/claps

X FACTOR 2011

10

IMPLEMENTING X FACTOR WITH CASSANDRA
Counting

Social network

No longer write-only

11

WHAT ARE MY FRIENDS DOING?
Scale makes this hard

10K changes/s

Which ones are relevant to which users?

When new users (and their social graph) can arrive at any time

12

SOLUTION
New Column Family - user activity

Maps user to their interactions

Write problem nicely randomised and thus ideal for Cassandra

Read problem!

13

COUNTING - HARDER THAN IT LOOKS
Everyone can count

But we need to count really fast

And distribute the results to all the clients

14

DISTRIBUTED COUNTING
“Memcache does counters”

“OK, how about sharding?”

“Well, I hear Cassandra 0.8 has counters”

15

ASIDE - THINGS THAT CAN’T COUNT #3
cache.set('key', 1)

cache.decr('key', 1)

>>> 0L

cache.decr('key', 1)

>>> 0L

cache.incr('key', -1)

>>> 4294967295L

cache.incr('key', 1)

>>> 4294967296L

16

SINGLE BOX LIMITS

17

SINGLE BOX LIMITS
We have a single value

17

SINGLE BOX LIMITS

Everything needs to read and write that value -
from multiple servers

17

SINGLE BOX LIMITS

Everything needs to read and write that value -
from multiple servers

EC2 limits

Single Memcache server runs out of
network I/O

What then?

17

CASSANDRA HAS COUNTERS
New (at the time) feature in Cassandra 0.8

Special column type - CounterColumnType as the validator

Distributed 64 bit counter, with eventual consistency

CL.ONE writes recommended to avoid implicit reads impacting performance

Reads tot up values from replicas to give value

Simple functionality

incr()/decr(), get()

18

CAN CASSANDRA COUNT?

19

Yes, But....

19

Yes, But....

Performance can be an issue

Switch off replicate_on_write, tune RF & cluster size

19

Yes, But....



Not scalable for single counter

Scales as function of RF up to 4 nodes

Above that ... you’re out of luck

Best we achieved is ~10K/s increments to single counter value on EC2 m1.large instances

19

Yes, But....



Not scalable for single counter

Scales as function of RF up to 4 nodes

Above that ... you’re out of luck

Best we achieved is ~10K/s increments to single counter value on EC2 m1.large instances

What do you do if an operation fails?

19

COUNTING AT SCALE WITH CASSANDRA
Write throughput to a single counter is limited

We were inside the performance limit, so writes could go to Cassandra

No way to scale within Cassandra (yet)

Reads have a serious performance overhead

We used sharded counters in memcached with source of truth in Cassandra

Few reads from Cass = much more predictable performance

20

OPERATIONS
Cassandra GUIs & mgmt consoles still in infancy

Hard to ﬁgure out what is going wrong when performance suffered

Analytics (and backup) still via dump to MySQL

Flexible, well understood

Single cluster, single AZ

21

WHERE WE WERE AFTER X FACTOR
Cassandra as a source of truth in production

Mainly write load

Memcached layer on top

Simple operations

No backups :(

22

BEYOND X FACTOR
Dancing on Ice - harder counting

Britain’s Got Talent 2012 - more social

Backups

Data integrity

23

DATA CONSISTENCY
There’s no referential integrity

So is the data in the database self-consistent?

Or do you have a bug somewhere?

How do you validate the data?

Truth + 1

24

BACKUPS
Backing up a cluster isn’t easy

Restoring can be harder...

25

CONCLUSION
Cassandra saved our bacon :)

Scales to insane write loads

Reads are easier to scale in memcached

Beware of limitations on “hot” values

Migrating functionality gradually let us learn the operational aspects

There are lots of interesting failure scenarios at scale

26

TODO
Scale-up/Scale-down of a cluster

Better monitoring and operations

Analytics using Hadoop

27

ANY QUESTIONS?

We’re hiring - if you want to work on wicked scaling problems and reach millions of
users, get in touch!

malcolm@tellybug.com

@malcolmbox

28

Cassandra EU 2012 - Putting the X Factor into Cassandra

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Cassandra EU 2012 - Putting the X Factor into Cassandra

Semelhante a Cassandra EU 2012 - Putting the X Factor into Cassandra (20)

Mais de Acunu

Mais de Acunu (20)

Último

Último (20)

Cassandra EU 2012 - Putting the X Factor into Cassandra

Notas do Editor