Learning Cassandra

Learning Cassandra
Dave Gardner
@davegardnerisme

What I’m going to cover

• How to NoSQL
• Cassandra basics (dynamo and
big table)
• How to use the data model in
real life

How to NoSQL

1. Find data store that doesn’t use SQL
2. Anything
3. Cram all the things into it
4. Triumphantly blog this success
5. Complain a month later when it
bursts into flames
http://www.slideshare.net/rbranson/how-do-i-cassandra/4

Choosing NoSQL

“NoSQL DBs trade off traditional
features to better support new and
emerging use cases”

http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-
solutions-to-hard-problems

Choosing Cassandra: Tradeoffs

More widely used, tested and
documented software
MySQL first OS release 1998

For a relatively immature product
Cassandra first open-sourced in 2008

Choosing Cassandra: Tradeoffs

Ad-hoc querying
SQL join, group by, having, order

For a rich data model with limited
ad-hoc querying ability
Cassandra makes you denormalise

Choosing NoSQL

“they say … I can’t decide between this project and
this project even though they look nothing like each
other. And the fact that you can’t decide indicates that
you don’t actually have a problem that requires
them.”

Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-
and-fast_ip

What do we get in return?

Proven horizontal scalability

Cassandra scales reads and writes
linearly as new nodes are added

Netflix benchmark: linear scaling

http://techblog.netflix.com/2011/11/benchmarking-
cassandra-scalability-on.html


High availability

Cassandra is fault-resistant with
tunable consistency levels


Low latency, solid
performance

Cassandra has very good write
performance

Performance benchmark *

http://blog.cubrid.org/dev-
platform/nosql-benchmarking/

* Add pinch of salt


Operational simplicity

Homogenous cluster, no “master”
node, no SPOF


Rich data model

Cassandra is more than simple key-
value – columns, composites,
counters, secondary indexes

How to NoSQL version 2

Learn about each solution

• What tradeoffs are you making?
• How is it designed?
• What algorithms does it use?
http://www.alberton.info/nosql_databases_what_when_why_phpuk201
1.html

Amazon Dynamo + Google Big Table

Consistent hashing Columnar
Vector clocks * SSTable storage
Gossip protocol Append-only
Hinted handoff Memtable
Read repair Compaction

http://www.allthingsdistributed.com/fi http://labs.google.com/papers/big
les/amazon-dynamo-sosp2007.pdf table-osdi06.pdf
* not in Cassandra

The dynamo paper
# tokens are
1 integers from
0 to 2127
# #
6 2

# #
5 3

Client
#
4

The dynamo paper
#
1

# #
6 2

consistent
hashing
Coordinator
# #
5 3

Client
#
4

Consistency levels

How many replicas must respond to
declare success?

Consistency levels: read operations

Level Description
ONE 1st Response
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas

http://wiki.apache.org/cassandra/API#Read

Consistency levels: write operations

Level Description
ANY One node, including hinted handoff
ONE One node
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas

http://wiki.apache.org/cassandra/API#Write

The dynamo paper
#
1 RF = 3
CL = One
# #
6 2

Coordinator
# #
5 3

Client
#
4

The dynamo paper
#
1 RF = 3
CL = Quorum
# #
6 2

Coordinator
# #
5 3

Client
#
4

The dynamo paper
#
1 RF = 3
CL = One
# + hint #
6 2

Coordinator
# #
5 3

Client
#
4

The dynamo paper
#
1 RF = 3
CL = One
# Read #
6 2
repair

Coordinator
# #
5 3

Client
#
4

The big table paper

• Sparse "columnar" data model
• SSTable disk storage
• Append-only commit log
• Memtable (buffer and sort)
• Immutable SSTable files
• Compaction
http://labs.google.com/papers/bigtable-osdi06.pdf
http://www.slideshare.net/geminimobile/bigtable-4820829

The big table paper

+ timestamp

Name

Value

Column

The big table paper

we can have millions
of columns *

Name Name Name

Value Value Value

Column Column Column

* theoretically up to 2 billion

The big table paper

Row

Name Name Name
Row Key
Value Value Value

Column Column Column

The big table paper

Column Family

Row Key Column Column Column



we can have billions of rows

The big table paper

Write Memtable

Flushed on
time/size trigger Memory
Disk
Commit Log SSTable SSTable

SSTable SSTable

Immutable

Data model basics: conflict resolution

Per-column timestamp-based conflict
resolution
{ {
column: foo, column: foo,
value: bar, value: zing,
timestamp: 1000 timestamp: 1001
} }

http://cassandra.apache.org/

Data model basics: conflict resolution

Per-column timestamp-based conflict
resolution
{ {
column: foo, column: foo,
value: bar, value: zing,
} }
bigger timestamp


Data model basics: column ordering

Columns ordered at time of writing,
according to Column Family schema
{ {
column: zebra, column: badger,
value: foo, value: foo,
} }


Data model basics: column ordering

Columns ordered at time of writing,
according to Column Family schema
{
badger: foo, with AsciiType column
zebra: foo schema
}


Key point

Each “query” can be answered from a
single slice of disk

(once compaction has finished)

Data modeling – 1000ft introduction

• Start from your queries and work
backwards
• Denormalise in the application
(store data more than once)

http://www.slideshare.net/mattdennis/cassandra-data-modeling
http://blip.tv/datastax/data-modeling-workshop-5496906

Pattern 1: not using the value

Storing that user X is in bucket Y

Row key: f97be9cc-5255-457…
Column name: foo
Value: 1
we don’t really care about this

https://github.com/davegardnerisme/we-have-your-
kidneys/blob/master/www/add.php#L53-58


Q: is user X in bucket foo?
f97be9cc-5255-4578-8813-76701c0945bd
bar: 1
A: single column
foo: 1
fetch
06a6f1b0-fcf2-41d9-8949-fe2d416bde8e
baz: 1
zoo: 1
503778bc-246f-4041-ac5a-fd944176b26d
aaa: 1


Q: which buckets is user X in?
f97be9cc-5255-4578-8813-76701c0945bd
bar: 1 A: column slice
foo: 1 fetch
06a6f1b0-fcf2-41d9-8949-fe2d416bde8e
baz: 1
zoo: 1
503778bc-246f-4041-ac5a-fd944176b26d
aaa: 1


We could also use expiring columns to
automatically delete columns N seconds
after insertion

UPDATE users
USING TTL = 3600
SET 'foo' = 1
WHERE KEY =
'f97be9cc-5255-4578-8813-76701c0945bd'

Pattern 2: counters

Real-time analytics to count
clicks/impressions of ads in hourly
buckets

Row key: 1
Column name: 2011103015-click
Value: 34

https://github.com/davegardnerisme/we-have-your-
kidneys/blob/master/www/adClick.php

Pattern 2: counters

Increment by 1 using CQL

UPDATE ads
SET '2011103015-impression'
= '2011103015-impression' + 1
WHERE KEY = '1’

Pattern 2: counters

Q: how many clicks/impressions for ad 1
over time range?
1
2011103015-click: 1
2011103015-impression: 3434
A: column slice
2011103016-click: 12
fetch, between
2011103016-impression: 5411
column X and Y
2011103017-click: 2
2011103017-impression: 345

Pattern 3: time series

Store canonical reference of impressions
and clicks

Row key: 20111030
Column name: <time UUID>
Value: {json} Cassandra can
order columns by
time

http://rubyscale.com/2011/basic-time-series-with-cassandra/

Pattern 4: object properties as columns

Store user properties such as name,
email, etc.

Row key: f97be9cc-5255-457…
Column name: name
Value: Bob Foo-Bar

http://www.wehaveyourkidneys.com/adPerformance.php?ad=1

Anti-pattern 1: read-before-write

Instead store as independent columns
and mutate individually

(see pattern 4)

Anti-pattern 2: super columns

Friends don’t let friends use super
columns.

http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-
the-unwary/

Anti-pattern 3: OPP

The Order Preserving Partitioner
unbalances your load and makes your
life harder

http://ria101.wordpress.com/2010/02/22/cassandra-
randompartitioner-vs-orderpreservingpartitioner/

Recap: Data modeling

• Think about the queries, work
backwards
• Don’t overuse single rows; try to
spread the load
• Don’t use super columns
• Ask on IRC! #cassandra

There’s more: Brisk

Integrated Hadoop distribution (without
HDFS installed). Run Hive and Pig queries
directly against Cassandra

DataStax offer this functionality in their
“Enterprise” product

http://www.datastax.com/products/enterprise

Hive: SQL-like interface to Hadoop

CREATE EXTERNAL TABLE tempUsers
(userUuid string, segmentId string, value string)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
"cassandra.columns.mapping" = ":key,:column,:value",
"cassandra.cf.name" = "users"
);

SELECT segmentId, count(1) AS total
FROM tempUsers
GROUP BY segmentId
ORDER BY total DESC;

In conclusion

Cassandra is founded on
sound design principles

In conclusion

The data model is incredibly
powerful

In conclusion

CQL and a new breed of
clients are making it easier
to use

In conclusion

Hadoop integration means we
can analyse data directly from
a Cassandra cluster

In conclusion

There is a strong community
and multiple companies
offering professional support

Thanks
looking for a job?

Learn more about Cassandra
meetup.com/Cassandra-London
Sample ad-targeting project on Github
https://github.com/davegardnerisme/we-have-your-kidneys

Watch videos from Cassandra SF 2011
http://www.datastax.com/events/cassandrasf2011/presentations

Learning Cassandra

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Learning Cassandra

Semelhante a Learning Cassandra (20)

Mais de Dave Gardner

Mais de Dave Gardner (11)

Último

Último (20)

Learning Cassandra

Notas do Editor