SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
Rethinking Topology in Cassandra


                            ApacheCon Europe
                            November 7, 2012



                                Eric Evans
                            eevans@acunu.com
                               @jericevans


Wednesday, November 7, 12                      1
DHT 101



Wednesday, November 7, 12             2
DHT 101
                                    partitioning
                                        Z    A




Wednesday, November 7, 12                                  3

The keyspace, a namespace encompassing all possible keys
DHT 101
                                      partitioning



                                Z                          A


                            Y                                    B


                                                             C



Wednesday, November 7, 12                                                                     4

The namespace is divided into N partitions (where N is the number of nodes). Partitions are
mapped to nodes and placed evenly throughout the namespace.
DHT 101
                                      partitioning



                                Z                           A


                            Y         Key = Aaa                   B


                                                             C



Wednesday, November 7, 12                                                                   5

A record, stored by key, is positioned on the next node (working clockwise) from where it
sorts in the namespace
DHT 101
                                    replica placement



                                Z                         A


                            Y         Key = Aaa                 B


                                                           C



Wednesday, November 7, 12                                                                  6

Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but
anything deterministic will work.
DHT 101
                                     consistency




                      Consistency
                      Availability
                      Partition tolerance


Wednesday, November 7, 12                                                                    7

With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;
At any given point, we can only guarantee 2 of Consistency, Availability, and Partition
tolerance.
DHT 101
                            scenario: consistency level = one


                                 A
                                                                        W

                                       ?



                                  ?



Wednesday, November 7, 12                                                                      8

Writing at consistency level ONE provides very high availability, only one in 3 member nodes
need be up for write to succeed
DHT 101
                            scenario: consistency level = all


                                 A
                                                                         R

                                       ?



                                  ?



Wednesday, November 7, 12                                                                       9

If strong consistency is required, reads with consistency ALL can be used of writes performed
at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
DHT 101
                            scenario: quorum write


                              A
                                                                 W

                  R+W > N           B



                               ?



Wednesday, November 7, 12                                            10

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
DHT 101
                             scenario: quorum read


                              ?



                  R+W > N           B


                                                                 R
                               C



Wednesday, November 7, 12                                            11

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
Awesome, yes?




Wednesday, November 7, 12                   12
Well...




Wednesday, November 7, 12             13
Problem:
                            Poor load distribution




Wednesday, November 7, 12                            14
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        15

B and C hold replicas of A
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        16

A and B hold replicas of Z
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        17

Z and A hold replicas of Y
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        18

Disaster strikes!
Distributing Load

                                 Z                              A


                             Y                                        B


                                                                 C
                                        M

Wednesday, November 7, 12                                                                       19

Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring
nodes
Distributing Load

                                 Z       A
                                         A1


                             Y                B


                                         C
                                     M

Wednesday, November 7, 12                         20

Solution: Replace/repair down node
Distributing Load

                                 Z       A
                                         A1


                             Y                B


                                         C
                                     M

Wednesday, November 7, 12                         21

Solution: Replace/repair down node
Distributing Load

                                 Z                        A
                                                          A1


                             Y                                 B


                                                           C
                                     M

Wednesday, November 7, 12                                                                22

Neighboring nodes are needed to stream missing data to A; Results in even more load on
neighboring nodes
Problem:
                            Poor data distribution




Wednesday, November 7, 12                            23
Distributing Data
                                    A



                                          C
                             D




                                    B

Wednesday, November 7, 12                       24

Ideal distribution of keyspace
Distributing Data
                                              A
                                 E


                                                                  C
                             D




                                              B

Wednesday, November 7, 12                                                        25

Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
Distributing Data
                                          A     A
                                  E


                                                             C
                             D
                                                             C
                              D


                                          B B

Wednesday, November 7, 12                                          26

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                          A     A
                                  E


                                                             C
                             D
                                                             C
                              D


                                          B B

Wednesday, November 7, 12                                          27

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                               A
                                 H                             E


                                                                    C
                             D


                                 G                             F
                                               B

Wednesday, November 7, 12                                                             28

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Distributing Data
                                               A
                                 H                             E


                                                                    C
                             D


                                 G                             F
                                               B

Wednesday, November 7, 12                                                             29

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Virtual Nodes



Wednesday, November 7, 12                   30
In a nutshell...

             host


                                                                              host


             host



Wednesday, November 7, 12                                                                     31

Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node
(host)
Benefits
                   • Operationally simpler (no token
                            management)
                   •        Better distribution of load
                   •        Concurrent streaming involving all hosts
                   •        Smaller partitions mean greater reliability
                   •        Supports heterogenous hardware


Wednesday, November 7, 12                                                 32
Strategies

                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment



Wednesday, November 7, 12                         33
Strategy
                                        Automatic Sharding



                   • Partitions are split when data exceeds a
                            threshold
                   • Newly created partitions are relocated to a
                            host with lower data load
                   • Similar to sharding performed by Bigtable,
                            or Mongo auto-sharding



Wednesday, November 7, 12                                          34
Strategy
                                     Fixed Partition Assignment

                   • Namespace divided into Q evenly-sized
                            partitions
                   • Q/N partitions assigned per host (where N
                            is the number of hosts)
                   • Joining hosts “steal” partitions evenly from
                            existing hosts.
                   • Used by Dynamo and Voldemort (described
                            in Dynamo paper as “strategy 3”)


Wednesday, November 7, 12                                           35
Strategy
                                    Random Token Assignment



                   • Each host assigned T random tokens
                   • T random tokens generated for joining
                            hosts; New tokens divide existing ranges
                   • Similar to libketama; Identical to Classic
                            Cassandra when T=1



Wednesday, November 7, 12                                              36
Considerations

                   1. Number of partitions
                   2. Partition size
                   3. How 1 changes with more nodes and data
                   4. How 2 changes with more nodes and data




Wednesday, November 7, 12                                      37
Evaluating
                            Strategy        No. Partitions   Partition size

                            Random                 O(N)         O(B/N)


                             Fixed                 O(1)          O(B)


                    Auto-sharding                  O(B)          O(1)

               B ~ total data size, N ~ number of hosts


Wednesday, November 7, 12                                                     38
Evaluating
                   • Automatic sharding
                     • partition size constant (great)
                     • number of partitions scales linearly with
                            data size (bad)
                   • Fixed partition assignment
                   • Random token assignment

Wednesday, November 7, 12                                          39
Evaluating
                   •        Automatic sharding
                   •        Fixed partition assignment
                        •     Number of partitions is constant (good)
                        •     Partition size scales linearly with data size
                              (bad)
                        •     Higher operational complexity (bad)
                   •        Random token assignment


Wednesday, November 7, 12                                                     40
Evaluating
                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment
                     • Number of partitions scales linearly with
                            number of hosts (good ok)
                        • Partition size increases with more data;
                            decreases with more hosts (good)


Wednesday, November 7, 12                                            41
Evaluating


                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment



Wednesday, November 7, 12                         42
Cassandra



Wednesday, November 7, 12               43
Configuration
                               conf/cassandra.yaml


               # Comma separated list of tokens,
               # (new installs only).
               initial_token:<token>,<token>,<token>

               or

               # Number of tokens to generate.
               num_tokens: 256



Wednesday, November 7, 12                                                            44

Two params control how tokens are assigned. The initial_token param now optionally
accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
Configuration
                                     nodetool info

      Token           :     (invoke with -T/--tokens to see all 256 tokens)
      ID              :     64090651-6034-41d5-bfc6-ddd24957f164
      Gossip active   :     true
      Thrift active   :     true
      Load            :     92.69 KB
      Generation No   :     1351030018
      Uptime (seconds):     45
      Heap Memory (MB):     95.16 / 1956.00
      Data Center     :     datacenter1
      Rack            :     rack1
      Exceptions      :     0
      Key Cache       :     size 240 (bytes), capacity 101711872 (bytes ...
      Row Cache       :     size 0 (bytes), capacity 0 (bytes), 0 hits, ...




Wednesday, November 7, 12                                                                      45

To keep the output readable, nodetool info no longer displays tokens (if there are more than
one), unless the -T/--tokens argument is passed
Configuration
                                              nodetool ring
      Datacenter: datacenter1
      ==========
      Replicas: 2

      Address               Rack    Status State    Load         Owns     Token
                                                                          9022770486425350384
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -9182469192098976078
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -9054823614314102214
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8970752544645156769
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8927190060345427739
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8880475677109843259
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8817876497520861779
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8810512134942064901
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8661764562509480261
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8641550925069186492
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8636224350654790732
      ...
      ...




Wednesday, November 7, 12                                                                          46

nodetool ring is still there, but the output is significantly more verbose, and it is less useful
as the go-to
Configuration
                                  nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1




Wednesday, November 7, 12                                                              47

New go-to command is nodetool status
Configuration
                                     nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                                       Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164           rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c           rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082           rack1




Wednesday, November 7, 12                                                                      48

Of note, since it is no longer practical to name a host by it’s token (because it can have
many), each host has a unique ID
Configuration
                                  nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1




Wednesday, November 7, 12                                                              49

Note the token per-node count
Migration
                                    A




                            C               B




Wednesday, November 7, 12                       50
Migration
                            edit conf/cassandra.yaml and restart




               # Number of tokens to generate.
               num_tokens: 256




Wednesday, November 7, 12                                          51

Step 1: Set num_tokens in cassandra.yaml, and restart node
Migration
                       convert to T contiguous tokens in existing ranges

                                        A AA
                                   A AA                   B




                                                             A
                                  A




                                                              AA
                                 A
                                A




                                                                 AA A
                               A
                               A
                               A




                                                                 AAA AA
                                 A A
                                       A




                                                         C
                                      A
                                     A



                                    A
                                    A

                                   A
                                   A
                                   A
                                   A




Wednesday, November 7, 12                                                                     52

This will cause the existing range to be split into T contiguous tokens. This results in no
change to placement
Migration
                                         shuffle

                                     A AA
                                A AA                    B




                                                           A
                               A




                                                            AA
                              A
                             A




                                                               AA A
                            A
                            A
                            A




                                                               AAA AA
                             A A
                                   A




                                                       C
                                  A
                                 A



                                A
                                A

                               A
                               A
                               A
                               A




Wednesday, November 7, 12                                                 53

Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
Shuffle

                   • Range transfers are queued on each host
                   • Hosts initiate transfer of ranges to self
                   • Pay attention to the logs!


Wednesday, November 7, 12                                        54
Shuffle
                                          bin/shuffle
      Usage: shuffle [options] <sub-command>

      Sub-commands:
       create               Initialize a new shuffle operation
       ls                   List pending relocations
       clear                Clear pending relocations
       en[able]             Enable shuffling
       dis[able]            Disable shuffling

      Options:
       -dc, --only-dc               Apply only to named DC (create only)
       -tp, --thrift-port           Thrift port number (Default: 9160)
       -p,   --port                 JMX port number (Default: 7199)
       -tf, --thrift-framed         Enable framed transport for Thrift (Default: false)
       -en, --and-enable            Immediately enable shuffling (create only)
       -H,   --help                 Print help information
       -h,   --host                 JMX hostname or IP address (Default: localhost)
       -th, --thrift-host           Thrift hostname or IP address (Default: JMX host)




Wednesday, November 7, 12                                                                 55
Performance



Wednesday, November 7, 12                 56
removenode
                 400


                300


                200


                100


                   0
                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1




Wednesday, November 7, 12                                                 57

17 node cluster of EC2 m1.large instances, 460M rows
bootstrap
                500


                375


                250


               125


                   0
                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1




Wednesday, November 7, 12                                                 58

17 node cluster of EC2 m1.large instances, 460M rows
The End
         • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
             Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
             Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s
             Highly Available Key-value Store” Web.

         • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
         • Overton, Sam. “Virtual Nodes Strategies.” Web.
         • Overton, Sam. “Virtual Nodes: Performance Results.” Web.
         • Jones, Richard. "libketama - a consistent hashing algo for memcache
             clients” Web.



Wednesday, November 7, 12                                                          59

Mais conteúdo relacionado

Mais procurados

Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Amazon Web Services
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 

Mais procurados (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Destaque

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talkPatrick McFadin
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraEric Evans
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseEric Evans
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseEric Evans
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced CassandraEric Evans
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - TrifactaVictor Coustenoble
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Eric Evans
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Eric Evans
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTVictor Coustenoble
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)Eric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3Eric Evans
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In CassandraEric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDEric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in CassandraEric Evans
 

Destaque (20)

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 

Mais de Eric Evans

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Eric Evans
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLEric Evans
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraEric Evans
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To CassandraEric Evans
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A NutshellEric Evans
 

Mais de Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Virtual Nodes: Rethinking Topology in Cassandra

  • 1. Rethinking Topology in Cassandra ApacheCon Europe November 7, 2012 Eric Evans eevans@acunu.com @jericevans Wednesday, November 7, 12 1
  • 3. DHT 101 partitioning Z A Wednesday, November 7, 12 3 The keyspace, a namespace encompassing all possible keys
  • 4. DHT 101 partitioning Z A Y B C Wednesday, November 7, 12 4 The namespace is divided into N partitions (where N is the number of nodes). Partitions are mapped to nodes and placed evenly throughout the namespace.
  • 5. DHT 101 partitioning Z A Y Key = Aaa B C Wednesday, November 7, 12 5 A record, stored by key, is positioned on the next node (working clockwise) from where it sorts in the namespace
  • 6. DHT 101 replica placement Z A Y Key = Aaa B C Wednesday, November 7, 12 6 Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but anything deterministic will work.
  • 7. DHT 101 consistency Consistency Availability Partition tolerance Wednesday, November 7, 12 7 With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem; At any given point, we can only guarantee 2 of Consistency, Availability, and Partition tolerance.
  • 8. DHT 101 scenario: consistency level = one A W ? ? Wednesday, November 7, 12 8 Writing at consistency level ONE provides very high availability, only one in 3 member nodes need be up for write to succeed
  • 9. DHT 101 scenario: consistency level = all A R ? ? Wednesday, November 7, 12 9 If strong consistency is required, reads with consistency ALL can be used of writes performed at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
  • 10. DHT 101 scenario: quorum write A W R+W > N B ? Wednesday, November 7, 12 10 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 11. DHT 101 scenario: quorum read ? R+W > N B R C Wednesday, November 7, 12 11 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 14. Problem: Poor load distribution Wednesday, November 7, 12 14
  • 15. Distributing Load Z A Y B C M Wednesday, November 7, 12 15 B and C hold replicas of A
  • 16. Distributing Load Z A Y B C M Wednesday, November 7, 12 16 A and B hold replicas of Z
  • 17. Distributing Load Z A Y B C M Wednesday, November 7, 12 17 Z and A hold replicas of Y
  • 18. Distributing Load Z A Y B C M Wednesday, November 7, 12 18 Disaster strikes!
  • 19. Distributing Load Z A Y B C M Wednesday, November 7, 12 19 Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring nodes
  • 20. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 20 Solution: Replace/repair down node
  • 21. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 21 Solution: Replace/repair down node
  • 22. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 22 Neighboring nodes are needed to stream missing data to A; Results in even more load on neighboring nodes
  • 23. Problem: Poor data distribution Wednesday, November 7, 12 23
  • 24. Distributing Data A C D B Wednesday, November 7, 12 24 Ideal distribution of keyspace
  • 25. Distributing Data A E C D B Wednesday, November 7, 12 25 Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
  • 26. Distributing Data A A E C D C D B B Wednesday, November 7, 12 26 Moving existing nodes means moving corresponding data; Not ideal
  • 27. Distributing Data A A E C D C D B B Wednesday, November 7, 12 27 Moving existing nodes means moving corresponding data; Not ideal
  • 28. Distributing Data A H E C D G F B Wednesday, November 7, 12 28 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 29. Distributing Data A H E C D G F B Wednesday, November 7, 12 29 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 31. In a nutshell... host host host Wednesday, November 7, 12 31 Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node (host)
  • 32. Benefits • Operationally simpler (no token management) • Better distribution of load • Concurrent streaming involving all hosts • Smaller partitions mean greater reliability • Supports heterogenous hardware Wednesday, November 7, 12 32
  • 33. Strategies • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 33
  • 34. Strategy Automatic Sharding • Partitions are split when data exceeds a threshold • Newly created partitions are relocated to a host with lower data load • Similar to sharding performed by Bigtable, or Mongo auto-sharding Wednesday, November 7, 12 34
  • 35. Strategy Fixed Partition Assignment • Namespace divided into Q evenly-sized partitions • Q/N partitions assigned per host (where N is the number of hosts) • Joining hosts “steal” partitions evenly from existing hosts. • Used by Dynamo and Voldemort (described in Dynamo paper as “strategy 3”) Wednesday, November 7, 12 35
  • 36. Strategy Random Token Assignment • Each host assigned T random tokens • T random tokens generated for joining hosts; New tokens divide existing ranges • Similar to libketama; Identical to Classic Cassandra when T=1 Wednesday, November 7, 12 36
  • 37. Considerations 1. Number of partitions 2. Partition size 3. How 1 changes with more nodes and data 4. How 2 changes with more nodes and data Wednesday, November 7, 12 37
  • 38. Evaluating Strategy No. Partitions Partition size Random O(N) O(B/N) Fixed O(1) O(B) Auto-sharding O(B) O(1) B ~ total data size, N ~ number of hosts Wednesday, November 7, 12 38
  • 39. Evaluating • Automatic sharding • partition size constant (great) • number of partitions scales linearly with data size (bad) • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 39
  • 40. Evaluating • Automatic sharding • Fixed partition assignment • Number of partitions is constant (good) • Partition size scales linearly with data size (bad) • Higher operational complexity (bad) • Random token assignment Wednesday, November 7, 12 40
  • 41. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment • Number of partitions scales linearly with number of hosts (good ok) • Partition size increases with more data; decreases with more hosts (good) Wednesday, November 7, 12 41
  • 42. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 42
  • 44. Configuration conf/cassandra.yaml # Comma separated list of tokens, # (new installs only). initial_token:<token>,<token>,<token> or # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 44 Two params control how tokens are assigned. The initial_token param now optionally accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
  • 45. Configuration nodetool info Token : (invoke with -T/--tokens to see all 256 tokens) ID : 64090651-6034-41d5-bfc6-ddd24957f164 Gossip active : true Thrift active : true Load : 92.69 KB Generation No : 1351030018 Uptime (seconds): 45 Heap Memory (MB): 95.16 / 1956.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : size 240 (bytes), capacity 101711872 (bytes ... Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ... Wednesday, November 7, 12 45 To keep the output readable, nodetool info no longer displays tokens (if there are more than one), unless the -T/--tokens argument is passed
  • 46. Configuration nodetool ring Datacenter: datacenter1 ========== Replicas: 2 Address Rack Status State Load Owns Token 9022770486425350384 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732 ... ... Wednesday, November 7, 12 46 nodetool ring is still there, but the output is significantly more verbose, and it is less useful as the go-to
  • 47. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 47 New go-to command is nodetool status
  • 48. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 48 Of note, since it is no longer practical to name a host by it’s token (because it can have many), each host has a unique ID
  • 49. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 49 Note the token per-node count
  • 50. Migration A C B Wednesday, November 7, 12 50
  • 51. Migration edit conf/cassandra.yaml and restart # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 51 Step 1: Set num_tokens in cassandra.yaml, and restart node
  • 52. Migration convert to T contiguous tokens in existing ranges A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 52 This will cause the existing range to be split into T contiguous tokens. This results in no change to placement
  • 53. Migration shuffle A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 53 Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
  • 54. Shuffle • Range transfers are queued on each host • Hosts initiate transfer of ranges to self • Pay attention to the logs! Wednesday, November 7, 12 54
  • 55. Shuffle bin/shuffle Usage: shuffle [options] <sub-command> Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Wednesday, November 7, 12 55
  • 57. removenode 400 300 200 100 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 57 17 node cluster of EC2 m1.large instances, 460M rows
  • 58. bootstrap 500 375 250 125 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 58 17 node cluster of EC2 m1.large instances, 460M rows
  • 59. The End • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web. • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web. • Overton, Sam. “Virtual Nodes Strategies.” Web. • Overton, Sam. “Virtual Nodes: Performance Results.” Web. • Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web. Wednesday, November 7, 12 59