2. What I’m going to cover
• How to NoSQL
• Cassandra basics (dynamo and
big table)
• How to use the data model in
real life
3. How to NoSQL
1. Find data store that doesn’t use SQL
2. Anything
3. Cram all the things into it
4. Triumphantly blog this success
5. Complain a month later when it
bursts into flames
http://www.slideshare.net/rbranson/how-do-i-cassandra/4
4. Choosing NoSQL
“NoSQL DBs trade off traditional
features to better support new and
emerging use cases”
http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-
solutions-to-hard-problems
5. Choosing Cassandra: Tradeoffs
More widely used, tested and
documented software
MySQL first OS release 1998
For a relatively immature product
Cassandra first open-sourced in 2008
6. Choosing Cassandra: Tradeoffs
Ad-hoc querying
SQL join, group by, having, order
For a rich data model with limited
ad-hoc querying ability
Cassandra makes you denormalise
7. Choosing NoSQL
“they say … I can’t decide between this project and
this project even though they look nothing like each
other. And the fact that you can’t decide indicates that
you don’t actually have a problem that requires
them.”
Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-
and-fast_ip
8. What do we get in return?
Proven horizontal scalability
Cassandra scales reads and writes
linearly as new nodes are added
9. Netflix benchmark: linear scaling
http://techblog.netflix.com/2011/11/benchmarking-
cassandra-scalability-on.html
10. What do we get in return?
High availability
Cassandra is fault-resistant with
tunable consistency levels
11. What do we get in return?
Low latency, solid
performance
Cassandra has very good write
performance
12. Performance benchmark *
http://blog.cubrid.org/dev-
platform/nosql-benchmarking/
* Add pinch of salt
13. What do we get in return?
Operational simplicity
Homogenous cluster, no “master”
node, no SPOF
14. What do we get in return?
Rich data model
Cassandra is more than simple key-
value – columns, composites,
counters, secondary indexes
15. How to NoSQL version 2
Learn about each solution
• What tradeoffs are you making?
• How is it designed?
• What algorithms does it use?
http://www.alberton.info/nosql_databases_what_when_why_phpuk201
1.html
16. Amazon Dynamo + Google Big Table
Consistent hashing Columnar
Vector clocks * SSTable storage
Gossip protocol Append-only
Hinted handoff Memtable
Read repair Compaction
http://www.allthingsdistributed.com/fi http://labs.google.com/papers/big
les/amazon-dynamo-sosp2007.pdf table-osdi06.pdf
* not in Cassandra
17. The dynamo paper
# tokens are
1 integers from
0 to 2127
# #
6 2
# #
5 3
Client
#
4
18. The dynamo paper
#
1
# #
6 2
consistent
hashing
Coordinator
# #
5 3
Client
#
4
20. Consistency levels: read operations
Level Description
ONE 1st Response
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Read
21. Consistency levels: write operations
Level Description
ANY One node, including hinted handoff
ONE One node
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas
http://wiki.apache.org/cassandra/API#Write
22. The dynamo paper
#
1 RF = 3
CL = One
# #
6 2
Coordinator
# #
5 3
Client
#
4
23. The dynamo paper
#
1 RF = 3
CL = Quorum
# #
6 2
Coordinator
# #
5 3
Client
#
4
24. The dynamo paper
#
1 RF = 3
CL = One
# + hint #
6 2
Coordinator
# #
5 3
Client
#
4
25. The dynamo paper
#
1 RF = 3
CL = One
# Read #
6 2
repair
Coordinator
# #
5 3
Client
#
4
26. The big table paper
• Sparse "columnar" data model
• SSTable disk storage
• Append-only commit log
• Memtable (buffer and sort)
• Immutable SSTable files
• Compaction
http://labs.google.com/papers/bigtable-osdi06.pdf
http://www.slideshare.net/geminimobile/bigtable-4820829
28. The big table paper
we can have millions
of columns *
Name Name Name
Value Value Value
Column Column Column
* theoretically up to 2 billion
29. The big table paper
Row
Name Name Name
Row Key
Value Value Value
Column Column Column
30. The big table paper
Column Family
Row Key Column Column Column
Row Key Column Column Column
Row Key Column Column Column
we can have billions of rows
31. The big table paper
Write Memtable
Flushed on
time/size trigger Memory
Disk
Commit Log SSTable SSTable
SSTable SSTable
Immutable
32. Data model basics: conflict resolution
Per-column timestamp-based conflict
resolution
{ {
column: foo, column: foo,
value: bar, value: zing,
timestamp: 1000 timestamp: 1001
} }
http://cassandra.apache.org/
33. Data model basics: conflict resolution
Per-column timestamp-based conflict
resolution
{ {
column: foo, column: foo,
value: bar, value: zing,
timestamp: 1000 timestamp: 1001
} }
bigger timestamp
http://cassandra.apache.org/
34. Data model basics: column ordering
Columns ordered at time of writing,
according to Column Family schema
{ {
column: zebra, column: badger,
value: foo, value: foo,
timestamp: 1000 timestamp: 1001
} }
http://cassandra.apache.org/
35. Data model basics: column ordering
Columns ordered at time of writing,
according to Column Family schema
{
badger: foo, with AsciiType column
zebra: foo schema
}
http://cassandra.apache.org/
36. Key point
Each “query” can be answered from a
single slice of disk
(once compaction has finished)
37. Data modeling – 1000ft introduction
• Start from your queries and work
backwards
• Denormalise in the application
(store data more than once)
http://www.slideshare.net/mattdennis/cassandra-data-modeling
http://blip.tv/datastax/data-modeling-workshop-5496906
38. Pattern 1: not using the value
Storing that user X is in bucket Y
Row key: f97be9cc-5255-457…
Column name: foo
Value: 1
we don’t really care about this
https://github.com/davegardnerisme/we-have-your-
kidneys/blob/master/www/add.php#L53-58
39. Pattern 1: not using the value
Q: is user X in bucket foo?
f97be9cc-5255-4578-8813-76701c0945bd
bar: 1
A: single column
foo: 1
fetch
06a6f1b0-fcf2-41d9-8949-fe2d416bde8e
baz: 1
zoo: 1
503778bc-246f-4041-ac5a-fd944176b26d
aaa: 1
40. Pattern 1: not using the value
Q: which buckets is user X in?
f97be9cc-5255-4578-8813-76701c0945bd
bar: 1 A: column slice
foo: 1 fetch
06a6f1b0-fcf2-41d9-8949-fe2d416bde8e
baz: 1
zoo: 1
503778bc-246f-4041-ac5a-fd944176b26d
aaa: 1
41. Pattern 1: not using the value
We could also use expiring columns to
automatically delete columns N seconds
after insertion
UPDATE users
USING TTL = 3600
SET 'foo' = 1
WHERE KEY =
'f97be9cc-5255-4578-8813-76701c0945bd'
42. Pattern 2: counters
Real-time analytics to count
clicks/impressions of ads in hourly
buckets
Row key: 1
Column name: 2011103015-click
Value: 34
https://github.com/davegardnerisme/we-have-your-
kidneys/blob/master/www/adClick.php
43. Pattern 2: counters
Increment by 1 using CQL
UPDATE ads
SET '2011103015-impression'
= '2011103015-impression' + 1
WHERE KEY = '1’
44. Pattern 2: counters
Q: how many clicks/impressions for ad 1
over time range?
1
2011103015-click: 1
2011103015-impression: 3434
A: column slice
2011103016-click: 12
fetch, between
2011103016-impression: 5411
column X and Y
2011103017-click: 2
2011103017-impression: 345
45. Pattern 3: time series
Store canonical reference of impressions
and clicks
Row key: 20111030
Column name: <time UUID>
Value: {json} Cassandra can
order columns by
time
http://rubyscale.com/2011/basic-time-series-with-cassandra/
46. Pattern 4: object properties as columns
Store user properties such as name,
email, etc.
Row key: f97be9cc-5255-457…
Column name: name
Value: Bob Foo-Bar
http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
48. Anti-pattern 2: super columns
Friends don’t let friends use super
columns.
http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-
the-unwary/
49. Anti-pattern 3: OPP
The Order Preserving Partitioner
unbalances your load and makes your
life harder
http://ria101.wordpress.com/2010/02/22/cassandra-
randompartitioner-vs-orderpreservingpartitioner/
50. Recap: Data modeling
• Think about the queries, work
backwards
• Don’t overuse single rows; try to
spread the load
• Don’t use super columns
• Ask on IRC! #cassandra
51. There’s more: Brisk
Integrated Hadoop distribution (without
HDFS installed). Run Hive and Pig queries
directly against Cassandra
DataStax offer this functionality in their
“Enterprise” product
http://www.datastax.com/products/enterprise
52. Hive: SQL-like interface to Hadoop
CREATE EXTERNAL TABLE tempUsers
(userUuid string, segmentId string, value string)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
"cassandra.columns.mapping" = ":key,:column,:value",
"cassandra.cf.name" = "users"
);
SELECT segmentId, count(1) AS total
FROM tempUsers
GROUP BY segmentId
ORDER BY total DESC;
55. In conclusion
CQL and a new breed of
clients are making it easier
to use
56. In conclusion
Hadoop integration means we
can analyse data directly from
a Cassandra cluster
57. In conclusion
There is a strong community
and multiple companies
offering professional support
58. Thanks
looking for a job?
Learn more about Cassandra
meetup.com/Cassandra-London
Sample ad-targeting project on Github
https://github.com/davegardnerisme/we-have-your-kidneys
Watch videos from Cassandra SF 2011
http://www.datastax.com/events/cassandrasf2011/presentations
Notas do Editor
This is the way that NoSQL is often approachedA light-hearted take on both how people approach NoSQL and to some extent the tools themselves
A better approach is to consider NoSQL in terms of tradeoffs