SlideShare uma empresa Scribd logo
1 de 58
Learning Cassandra
Dave Gardner
@davegardnerisme
What I’m going to cover


   • How to NoSQL
   • Cassandra basics (dynamo and
     big table)
   • How to use the data model in
     real life
How to NoSQL

 1.    Find data store that doesn’t use SQL
 2.    Anything
 3.    Cram all the things into it
 4.    Triumphantly blog this success
 5.    Complain a month later when it
       bursts into flames
 http://www.slideshare.net/rbranson/how-do-i-cassandra/4
Choosing NoSQL


  “NoSQL DBs trade off traditional
  features to better support new and
  emerging use cases”

  http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-
  solutions-to-hard-problems
Choosing Cassandra: Tradeoffs


   More widely used, tested and
   documented software
   MySQL first OS release 1998


   For a relatively immature product
   Cassandra first open-sourced in 2008
Choosing Cassandra: Tradeoffs


   Ad-hoc querying
   SQL join, group by, having, order



   For a rich data model with limited
   ad-hoc querying ability
   Cassandra makes you denormalise
Choosing NoSQL

“they say … I can’t decide between this project and
this project even though they look nothing like each
other. And the fact that you can’t decide indicates that
you don’t actually have a problem that requires
them.”

Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-
and-fast_ip
What do we get in return?


   Proven horizontal scalability

   Cassandra scales reads and writes
   linearly as new nodes are added
Netflix benchmark: linear scaling




  http://techblog.netflix.com/2011/11/benchmarking-
  cassandra-scalability-on.html
What do we get in return?


   High availability

   Cassandra is fault-resistant with
   tunable consistency levels
What do we get in return?


   Low latency, solid
   performance

   Cassandra has very good write
   performance
Performance benchmark *


                         http://blog.cubrid.org/dev-
                     platform/nosql-benchmarking/




                                    * Add pinch of salt
What do we get in return?


   Operational simplicity

   Homogenous cluster, no “master”
   node, no SPOF
What do we get in return?


   Rich data model

   Cassandra is more than simple key-
   value – columns, composites,
   counters, secondary indexes
How to NoSQL version 2

 Learn about each solution

 • What tradeoffs are you making?
 • How is it designed?
 • What algorithms does it use?
 http://www.alberton.info/nosql_databases_what_when_why_phpuk201
 1.html
Amazon Dynamo                      +       Google Big Table

Consistent hashing                                 Columnar
Vector clocks *                               SSTable storage
Gossip protocol                                 Append-only
Hinted handoff                                     Memtable
Read repair                                      Compaction

http://www.allthingsdistributed.com/fi http://labs.google.com/papers/big
les/amazon-dynamo-sosp2007.pdf                           table-osdi06.pdf
* not in Cassandra
The dynamo paper
                   #       tokens are
                   1       integers from
                           0 to 2127
         #             #
         6             2




         #             #
         5             3

Client
                   #
                   4
The dynamo paper
                          #
                          1


                   #                #
                   6                2




                       consistent
                       hashing
     Coordinator
                   #                #
                   5                3

Client
                          #
                          4
Consistency levels

 How many replicas must respond to
 declare success?
Consistency levels: read operations

  Level                Description
  ONE                  1st Response
  QUORUM               N/2 + 1 replicas
  LOCAL_QUORUM N/2 + 1 replicas in local data centre
  EACH_QUORUM          N/2 + 1 replicas in each data centre
  ALL                  All replicas


 http://wiki.apache.org/cassandra/API#Read
Consistency levels: write operations

  Level                Description
  ANY                  One node, including hinted handoff
  ONE                  One node
  QUORUM               N/2 + 1 replicas
  LOCAL_QUORUM N/2 + 1 replicas in local data centre
  EACH_QUORUM          N/2 + 1 replicas in each data centre
  ALL                  All replicas

 http://wiki.apache.org/cassandra/API#Write
The dynamo paper
                       #
                       1       RF = 3
                               CL = One
                   #       #
                   6       2




     Coordinator
                   #       #
                   5       3

Client
                       #
                       4
The dynamo paper
                       #
                       1       RF = 3
                               CL = Quorum
                   #       #
                   6       2




     Coordinator
                   #       #
                   5       3

Client
                       #
                       4
The dynamo paper
                       #
                       1                RF = 3
                                        CL = One
                   #       + hint   #
                   6                2




     Coordinator
                   #                #
                   5                3

Client
                       #
                       4
The dynamo paper
                       #
                       1                RF = 3
                                        CL = One
                   #        Read    #
                   6                2
                           repair



     Coordinator
                   #                #
                   5                3

Client
                       #
                       4
The big table paper

 •   Sparse "columnar" data model
 •   SSTable disk storage
 •   Append-only commit log
 •   Memtable (buffer and sort)
 •   Immutable SSTable files
 •   Compaction
 http://labs.google.com/papers/bigtable-osdi06.pdf
 http://www.slideshare.net/geminimobile/bigtable-4820829
The big table paper


                      + timestamp


             Name


             Value

             Column
The big table paper

we can have millions
        of columns *

                       Name     Name              Name


                       Value    Value             Value

                       Column   Column           Column



                                        * theoretically up to 2 billion
The big table paper

                       Row



             Name     Name     Name
   Row Key
             Value    Value    Value

             Column   Column   Column
The big table paper

                      Column Family


   Row Key   Column      Column         Column



   Row Key   Column      Column        Column



   Row Key   Column      Column        Column


                            we can have billions of rows
The big table paper

Write             Memtable


                          Flushed on
                       time/size trigger    Memory
                                               Disk
    Commit Log     SSTable        SSTable



                   SSTable        SSTable


                         Immutable
Data model basics: conflict resolution

 Per-column timestamp-based conflict
 resolution
 {                              {
     column: foo,                   column: foo,
     value: bar,                    value: zing,
     timestamp: 1000                timestamp: 1001
 }                              }

 http://cassandra.apache.org/
Data model basics: conflict resolution

 Per-column timestamp-based conflict
 resolution
 {                              {
     column: foo,                   column: foo,
     value: bar,                    value: zing,
     timestamp: 1000                timestamp: 1001
 }                              }
                                     bigger timestamp

 http://cassandra.apache.org/
Data model basics: column ordering

 Columns ordered at time of writing,
 according to Column Family schema
 {                              {
     column: zebra,                 column: badger,
     value: foo,                    value: foo,
     timestamp: 1000                timestamp: 1001
 }                              }

 http://cassandra.apache.org/
Data model basics: column ordering

 Columns ordered at time of writing,
 according to Column Family schema
 {
     badger: foo,               with AsciiType column
     zebra: foo                 schema
 }


 http://cassandra.apache.org/
Key point

 Each “query” can be answered from a
 single slice of disk

 (once compaction has finished)
Data modeling – 1000ft introduction

 • Start from your queries and work
   backwards
 • Denormalise in the application
   (store data more than once)


 http://www.slideshare.net/mattdennis/cassandra-data-modeling
 http://blip.tv/datastax/data-modeling-workshop-5496906
Pattern 1: not using the value

 Storing that user X is in bucket Y

 Row key:                  f97be9cc-5255-457…
 Column name:              foo
 Value:                    1
                                  we don’t really care about this


 https://github.com/davegardnerisme/we-have-your-
 kidneys/blob/master/www/add.php#L53-58
Pattern 1: not using the value

 Q: is user X in bucket foo?
 f97be9cc-5255-4578-8813-76701c0945bd
    bar: 1
                                        A: single column
    foo: 1
                                        fetch
 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e
    baz: 1
    zoo: 1
 503778bc-246f-4041-ac5a-fd944176b26d
    aaa: 1
Pattern 1: not using the value

 Q: which buckets is user X in?
 f97be9cc-5255-4578-8813-76701c0945bd
    bar: 1                              A: column slice
    foo: 1                              fetch
 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e
    baz: 1
    zoo: 1
 503778bc-246f-4041-ac5a-fd944176b26d
    aaa: 1
Pattern 1: not using the value

 We could also use expiring columns to
 automatically delete columns N seconds
 after insertion

 UPDATE users
 USING TTL = 3600
 SET 'foo' = 1
 WHERE KEY =
     'f97be9cc-5255-4578-8813-76701c0945bd'
Pattern 2: counters

 Real-time analytics to count
 clicks/impressions of ads in hourly
 buckets

 Row key:                  1
 Column name:              2011103015-click
 Value:                    34


 https://github.com/davegardnerisme/we-have-your-
 kidneys/blob/master/www/adClick.php
Pattern 2: counters

 Increment by 1 using CQL

 UPDATE ads
 SET '2011103015-impression'
     = '2011103015-impression' + 1
 WHERE KEY = '1’
Pattern 2: counters

 Q: how many clicks/impressions for ad 1
 over time range?
 1
     2011103015-click: 1
     2011103015-impression: 3434
                                   A: column slice
     2011103016-click: 12
                                   fetch, between
     2011103016-impression: 5411
                                   column X and Y
     2011103017-click: 2
     2011103017-impression: 345
Pattern 3: time series

 Store canonical reference of impressions
 and clicks

 Row key:                    20111030
 Column name:                <time UUID>
 Value:                      {json}                  Cassandra can
                                                     order columns by
                                                     time


 http://rubyscale.com/2011/basic-time-series-with-cassandra/
Pattern 4: object properties as columns

 Store user properties such as name,
 email, etc.

 Row key:                 f97be9cc-5255-457…
 Column name:             name
 Value:                   Bob Foo-Bar



 http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
Anti-pattern 1: read-before-write

 Instead store as independent columns
 and mutate individually

 (see pattern 4)
Anti-pattern 2: super columns

 Friends don’t let friends use super
 columns.




 http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-
 the-unwary/
Anti-pattern 3: OPP

 The Order Preserving Partitioner
 unbalances your load and makes your
 life harder



 http://ria101.wordpress.com/2010/02/22/cassandra-
 randompartitioner-vs-orderpreservingpartitioner/
Recap: Data modeling

 • Think about the queries, work
   backwards
 • Don’t overuse single rows; try to
   spread the load
 • Don’t use super columns
 • Ask on IRC! #cassandra
There’s more: Brisk

 Integrated Hadoop distribution (without
 HDFS installed). Run Hive and Pig queries
 directly against Cassandra

 DataStax offer this functionality in their
 “Enterprise” product

 http://www.datastax.com/products/enterprise
Hive: SQL-like interface to Hadoop

CREATE EXTERNAL TABLE tempUsers
    (userUuid string, segmentId string, value string)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
    "cassandra.columns.mapping" = ":key,:column,:value",
    "cassandra.cf.name" = "users"
    );


SELECT segmentId, count(1) AS total
FROM tempUsers
GROUP BY segmentId
ORDER BY total DESC;
In conclusion


 Cassandra is founded on
 sound design principles
In conclusion


 The data model is incredibly
 powerful
In conclusion


 CQL and a new breed of
 clients are making it easier
 to use
In conclusion


 Hadoop integration means we
 can analyse data directly from
 a Cassandra cluster
In conclusion


 There is a strong community
 and multiple companies
 offering professional support
Thanks
                                          looking for a job?


Learn more about Cassandra
meetup.com/Cassandra-London
Sample ad-targeting project on Github
https://github.com/davegardnerisme/we-have-your-kidneys

Watch videos from Cassandra SF 2011
http://www.datastax.com/events/cassandrasf2011/presentations

Mais conteúdo relacionado

Mais procurados

Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
mubarakss
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
mubarakss
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
DataStax
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 

Mais procurados (20)

Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
 
Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Introduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache CassandraIntroduction to NoSQL & Apache Cassandra
Introduction to NoSQL & Apache Cassandra
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Cassandra Presentation for San Antonio JUG
Cassandra Presentation for San Antonio JUGCassandra Presentation for San Antonio JUG
Cassandra Presentation for San Antonio JUG
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
 
Cassandra Community Webinar: Back to Basics with CQL3
Cassandra Community Webinar: Back to Basics with CQL3Cassandra Community Webinar: Back to Basics with CQL3
Cassandra Community Webinar: Back to Basics with CQL3
 
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developeIntroduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Node.js and Cassandra
Node.js and CassandraNode.js and Cassandra
Node.js and Cassandra
 

Destaque

Destaque (20)

An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Indexing in Cassandra
Indexing in CassandraIndexing in Cassandra
Indexing in Cassandra
 
Migration from MySQL to Cassandra for millions of active users
Migration from MySQL to Cassandra for millions of active usersMigration from MySQL to Cassandra for millions of active users
Migration from MySQL to Cassandra for millions of active users
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Intro to Relational Databases
Intro to Relational DatabasesIntro to Relational Databases
Intro to Relational Databases
 
NDC London 2014: Thinking Like an Erlanger
NDC London 2014: Thinking Like an ErlangerNDC London 2014: Thinking Like an Erlanger
NDC London 2014: Thinking Like an Erlanger
 
Webinar Cassandra Anti-Patterns
Webinar Cassandra Anti-PatternsWebinar Cassandra Anti-Patterns
Webinar Cassandra Anti-Patterns
 
Fears, misconceptions, and accepted anti patterns of a first time cassandra a...
Fears, misconceptions, and accepted anti patterns of a first time cassandra a...Fears, misconceptions, and accepted anti patterns of a first time cassandra a...
Fears, misconceptions, and accepted anti patterns of a first time cassandra a...
 

Semelhante a Learning Cassandra

Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Nzpug welly-cassandra-02-12-2010
Nzpug welly-cassandra-02-12-2010Nzpug welly-cassandra-02-12-2010
Nzpug welly-cassandra-02-12-2010
aaronmorton
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
Murat Çakal
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
jbellis
 

Semelhante a Learning Cassandra (20)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
AWS re:Invent 2016: Cross-Region Replication with Amazon DynamoDB Streams (DA...
 
Measuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesMeasuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS Instances
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
NoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
NoSQL Data Stores: Introduzione alle Basi di Dati Non RelazionaliNoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
NoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
 
Nzpug welly-cassandra-02-12-2010
Nzpug welly-cassandra-02-12-2010Nzpug welly-cassandra-02-12-2010
Nzpug welly-cassandra-02-12-2010
 
DevoxxFR 2016 - 3 degrees of MoM
DevoxxFR 2016 - 3 degrees of MoMDevoxxFR 2016 - 3 degrees of MoM
DevoxxFR 2016 - 3 degrees of MoM
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapest
 
Cassandra
CassandraCassandra
Cassandra
 

Mais de Dave Gardner

Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011
Dave Gardner
 
2011.07.18 cassandrameetup
2011.07.18 cassandrameetup2011.07.18 cassandrameetup
2011.07.18 cassandrameetup
Dave Gardner
 

Mais de Dave Gardner (11)

Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011Intro slides from Cassandra London July 2011
Intro slides from Cassandra London July 2011
 
2011.07.18 cassandrameetup
2011.07.18 cassandrameetup2011.07.18 cassandrameetup
2011.07.18 cassandrameetup
 
Cassandra + Hadoop = Brisk
Cassandra + Hadoop = BriskCassandra + Hadoop = Brisk
Cassandra + Hadoop = Brisk
 
Introduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web MeetupIntroduction to Cassandra at London Web Meetup
Introduction to Cassandra at London Web Meetup
 
Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2Running Cassandra on Amazon EC2
Running Cassandra on Amazon EC2
 
PHP and Cassandra
PHP and CassandraPHP and Cassandra
PHP and Cassandra
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Learning Cassandra

  • 2. What I’m going to cover • How to NoSQL • Cassandra basics (dynamo and big table) • How to use the data model in real life
  • 3. How to NoSQL 1. Find data store that doesn’t use SQL 2. Anything 3. Cram all the things into it 4. Triumphantly blog this success 5. Complain a month later when it bursts into flames http://www.slideshare.net/rbranson/how-do-i-cassandra/4
  • 4. Choosing NoSQL “NoSQL DBs trade off traditional features to better support new and emerging use cases” http://www.slideshare.net/argv0/riak-use-cases-dissecting-the- solutions-to-hard-problems
  • 5. Choosing Cassandra: Tradeoffs More widely used, tested and documented software MySQL first OS release 1998 For a relatively immature product Cassandra first open-sourced in 2008
  • 6. Choosing Cassandra: Tradeoffs Ad-hoc querying SQL join, group by, having, order For a rich data model with limited ad-hoc querying ability Cassandra makes you denormalise
  • 7. Choosing NoSQL “they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.” Benjamin Black – NoSQL Tapes (at 30:15) http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing- and-fast_ip
  • 8. What do we get in return? Proven horizontal scalability Cassandra scales reads and writes linearly as new nodes are added
  • 9. Netflix benchmark: linear scaling http://techblog.netflix.com/2011/11/benchmarking- cassandra-scalability-on.html
  • 10. What do we get in return? High availability Cassandra is fault-resistant with tunable consistency levels
  • 11. What do we get in return? Low latency, solid performance Cassandra has very good write performance
  • 12. Performance benchmark * http://blog.cubrid.org/dev- platform/nosql-benchmarking/ * Add pinch of salt
  • 13. What do we get in return? Operational simplicity Homogenous cluster, no “master” node, no SPOF
  • 14. What do we get in return? Rich data model Cassandra is more than simple key- value – columns, composites, counters, secondary indexes
  • 15. How to NoSQL version 2 Learn about each solution • What tradeoffs are you making? • How is it designed? • What algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk201 1.html
  • 16. Amazon Dynamo + Google Big Table Consistent hashing Columnar Vector clocks * SSTable storage Gossip protocol Append-only Hinted handoff Memtable Read repair Compaction http://www.allthingsdistributed.com/fi http://labs.google.com/papers/big les/amazon-dynamo-sosp2007.pdf table-osdi06.pdf * not in Cassandra
  • 17. The dynamo paper # tokens are 1 integers from 0 to 2127 # # 6 2 # # 5 3 Client # 4
  • 18. The dynamo paper # 1 # # 6 2 consistent hashing Coordinator # # 5 3 Client # 4
  • 19. Consistency levels How many replicas must respond to declare success?
  • 20. Consistency levels: read operations Level Description ONE 1st Response QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Read
  • 21. Consistency levels: write operations Level Description ANY One node, including hinted handoff ONE One node QUORUM N/2 + 1 replicas LOCAL_QUORUM N/2 + 1 replicas in local data centre EACH_QUORUM N/2 + 1 replicas in each data centre ALL All replicas http://wiki.apache.org/cassandra/API#Write
  • 22. The dynamo paper # 1 RF = 3 CL = One # # 6 2 Coordinator # # 5 3 Client # 4
  • 23. The dynamo paper # 1 RF = 3 CL = Quorum # # 6 2 Coordinator # # 5 3 Client # 4
  • 24. The dynamo paper # 1 RF = 3 CL = One # + hint # 6 2 Coordinator # # 5 3 Client # 4
  • 25. The dynamo paper # 1 RF = 3 CL = One # Read # 6 2 repair Coordinator # # 5 3 Client # 4
  • 26. The big table paper • Sparse "columnar" data model • SSTable disk storage • Append-only commit log • Memtable (buffer and sort) • Immutable SSTable files • Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829
  • 27. The big table paper + timestamp Name Value Column
  • 28. The big table paper we can have millions of columns * Name Name Name Value Value Value Column Column Column * theoretically up to 2 billion
  • 29. The big table paper Row Name Name Name Row Key Value Value Value Column Column Column
  • 30. The big table paper Column Family Row Key Column Column Column Row Key Column Column Column Row Key Column Column Column we can have billions of rows
  • 31. The big table paper Write Memtable Flushed on time/size trigger Memory Disk Commit Log SSTable SSTable SSTable SSTable Immutable
  • 32. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • 33. Data model basics: conflict resolution Per-column timestamp-based conflict resolution { { column: foo, column: foo, value: bar, value: zing, timestamp: 1000 timestamp: 1001 } } bigger timestamp http://cassandra.apache.org/
  • 34. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { { column: zebra, column: badger, value: foo, value: foo, timestamp: 1000 timestamp: 1001 } } http://cassandra.apache.org/
  • 35. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/
  • 36. Key point Each “query” can be answered from a single slice of disk (once compaction has finished)
  • 37. Data modeling – 1000ft introduction • Start from your queries and work backwards • Denormalise in the application (store data more than once) http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906
  • 38. Pattern 1: not using the value Storing that user X is in bucket Y Row key: f97be9cc-5255-457… Column name: foo Value: 1 we don’t really care about this https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/add.php#L53-58
  • 39. Pattern 1: not using the value Q: is user X in bucket foo? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: single column foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
  • 40. Pattern 1: not using the value Q: which buckets is user X in? f97be9cc-5255-4578-8813-76701c0945bd bar: 1 A: column slice foo: 1 fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8e baz: 1 zoo: 1 503778bc-246f-4041-ac5a-fd944176b26d aaa: 1
  • 41. Pattern 1: not using the value We could also use expiring columns to automatically delete columns N seconds after insertion UPDATE users USING TTL = 3600 SET 'foo' = 1 WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'
  • 42. Pattern 2: counters Real-time analytics to count clicks/impressions of ads in hourly buckets Row key: 1 Column name: 2011103015-click Value: 34 https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/adClick.php
  • 43. Pattern 2: counters Increment by 1 using CQL UPDATE ads SET '2011103015-impression' = '2011103015-impression' + 1 WHERE KEY = '1’
  • 44. Pattern 2: counters Q: how many clicks/impressions for ad 1 over time range? 1 2011103015-click: 1 2011103015-impression: 3434 A: column slice 2011103016-click: 12 fetch, between 2011103016-impression: 5411 column X and Y 2011103017-click: 2 2011103017-impression: 345
  • 45. Pattern 3: time series Store canonical reference of impressions and clicks Row key: 20111030 Column name: <time UUID> Value: {json} Cassandra can order columns by time http://rubyscale.com/2011/basic-time-series-with-cassandra/
  • 46. Pattern 4: object properties as columns Store user properties such as name, email, etc. Row key: f97be9cc-5255-457… Column name: name Value: Bob Foo-Bar http://www.wehaveyourkidneys.com/adPerformance.php?ad=1
  • 47. Anti-pattern 1: read-before-write Instead store as independent columns and mutate individually (see pattern 4)
  • 48. Anti-pattern 2: super columns Friends don’t let friends use super columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/
  • 49. Anti-pattern 3: OPP The Order Preserving Partitioner unbalances your load and makes your life harder http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/
  • 50. Recap: Data modeling • Think about the queries, work backwards • Don’t overuse single rows; try to spread the load • Don’t use super columns • Ask on IRC! #cassandra
  • 51. There’s more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax offer this functionality in their “Enterprise” product http://www.datastax.com/products/enterprise
  • 52. Hive: SQL-like interface to Hadoop CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" ); SELECT segmentId, count(1) AS total FROM tempUsers GROUP BY segmentId ORDER BY total DESC;
  • 53. In conclusion Cassandra is founded on sound design principles
  • 54. In conclusion The data model is incredibly powerful
  • 55. In conclusion CQL and a new breed of clients are making it easier to use
  • 56. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster
  • 57. In conclusion There is a strong community and multiple companies offering professional support
  • 58. Thanks looking for a job? Learn more about Cassandra meetup.com/Cassandra-London Sample ad-targeting project on Github https://github.com/davegardnerisme/we-have-your-kidneys Watch videos from Cassandra SF 2011 http://www.datastax.com/events/cassandrasf2011/presentations

Notas do Editor

  1. This is the way that NoSQL is often approachedA light-hearted take on both how people approach NoSQL and to some extent the tools themselves
  2. A better approach is to consider NoSQL in terms of tradeoffs
  3. Sums it up
  4. 1st
  5. 2nd
  6. 3rd
  7. 4th
  8. 5th and last
  9. A better approach
  10. Last slide