SlideShare uma empresa Scribd logo
1 de 24
Cassandra
Structured Storage System over a P2P Network




          Avinash Lakshman, Prashant Malik
Why Cassandra?
• Lots of data
  – Copies of messages, reverse indices of
    messages, per user data.
• Many incoming requests resulting in a lot
  of random reads and random writes.
• No existing production ready solutions in
  the market meet these requirements.
Design Goals
• High availability
• Eventual consistency
  – trade-off strong consistency in favor of high
    availability
• Incremental scalability
• Optimistic Replication
• “Knobs” to tune tradeoffs between consistency,
  durability and latency
• Low total cost of ownership
• Minimal administration
Data Model                                                       Columns are
                                                                                     added and
                              ColumnFamily1 Name : MailList                           modified
                                                                         Type : Simple Sort : Name
 KEY                          Name : tid1         Name : tid2           Name : tid3 dynamically
                                                                                        Name : tid4
                              Value : <Binary>    Value : <Binary>      Value : <Binary>        Value : <Binary>
                              TimeStamp : t1      TimeStamp : t2        TimeStamp : t3          TimeStamp : t4




                        ColumnFamily2            Name : WordList            Type : Super            Sort : Time
Column Families         Name : aloha                                                     Name : dude
  are declared           C1             C2             C3          C4                      C2             C6
     upfront
 SuperColumns            V1             V2             V3          V4                      V2             V6

 are added and           T1             T2             T3          T4                      T2             T6

    modified
Columns are
  dynamically
 added and
  modified        ColumnFamily3 Name : System                Type : Super       Sort : Name
dynamically       Name : hint1         Name : hint2         Name : hint3       Name : hint4
                  <Column List>        <Column List>        <Column List>      <Column List>
Write Operations
• A client issues a write request to a random
  node in the Cassandra cluster.
• The “Partitioner” determines the nodes
  responsible for the data.
• Locally, write operations are logged and
  then applied to an in-memory version.
• Commit log is stored on a dedicated disk
  local to the machine.
Write cont’d
Key (CF1 , CF2 , CF3)                                                         • Data size
                                                                              • Number of Objects
                                   Memtable ( CF1)
                                                                              • Lifetime

 Commit Log                        Memtable ( CF2)
 Binary serialized
 Key ( CF1 , CF2 , CF3 )           Memtable ( CF2)

                                                                         Data file on disk
                                               <Key name><Size of key Data><Index of columns/supercolumns><
                                               Serialized column family>
                           K128 Offset         ---
                                               ---
                           K256 Offset          BLOCK Index <Key Name> Offset, <Key Name> Offset
     Dedicated Disk
                                               ---
                           K384 Offset         ---
                                               <Key name><Size of key Data><Index of columns/supercolumns><
                            Bloom Filter       Serialized column family>

                           (Index in memory)
Compactions
                                                     K2 < Serialized data >             K4 < Serialized data >
              K1 < Serialized data >
                                                     K10 < Serialized data >            K5 < Serialized data >
              K2 < Serialized data >
                                                     K30 < Serialized data >            K10 < Serialized data >
              K3 < Serialized data >



                                   DELETED
                                                     --                                 --
              --
                                        Sorted       --                        Sorted   --
Sorted        --
                                                     --                                 --
              --




                                            MERGE SORT


   Index File
                                                   K1 < Serialized data >
          Loaded in memory                         K2 < Serialized data >
                                                   K3 < Serialized data >
         K1 Offset
                                                   K4 < Serialized data >
         K5 Offset                     Sorted
                                                   K5 < Serialized data >
         K30 Offset
                                                   K10 < Serialized data >
         Bloom Filter
                                                   K30 < Serialized data >

                                                 Data File
Write Properties
•   No locks in the critical path
•   Sequential disk access
•   Behaves like a write back Cache
•   Append support without read ahead
•   Atomicity guarantee for a key
• “Always Writable”
    – accept writes during failure scenarios
Read
                         Client


                  Query       Result

                       Cassandra Cluster


          Closest replica     Result                   Read repair if
                                                       digests differ
                        Replica A


                       Digest Query
Digest Response                            Digest Response


           Replica B                   Replica C
Partitioning And Replication
                          1 0           h(key1)
                   E
                                      A           N=3

          C

h(key2)                                    F


                                       B
              D

                          1/2
                                                        10
Cluster Membership and Failure
              Detection
•   Gossip protocol is used for cluster membership.
•   Super lightweight with mathematically provable properties.
•   State disseminated in O(logN) rounds where N is the number of
    nodes in the cluster.
•   Every T seconds each member increments its heartbeat counter
    and selects one other member to send its list to.
•   A member merges the list with its own list .
Accrual Failure Detector
•   Valuable for system management, replication, load balancing etc.
•   Defined as a failure detector that outputs a value, PHI, associated
    with each process.
•   Also known as Adaptive Failure detectors - designed to adapt to
    changing network conditions.
•   The value output, PHI, represents a suspicion level.
•   Applications set an appropriate threshold, trigger suspicions and
    perform appropriate actions.
•   In Cassandra the average time taken to detect a failure is 10-15
    seconds with the PHI threshold set at 5.
Properties of the Failure Detector
•   If a process p is faulty, the suspicion level
                  Φ(t)  ∞as t  ∞.
•   If a process p is faulty, there is a time after which Φ(t) is monotonic
    increasing.
•   A process p is correct  Φ(t) has an ub over an infinite execution.
•   If process p is correct, then for any time T,
                  Φ(t) = 0 for t >= T.
Implementation
•   PHI estimation is done in three phases
     – Inter arrival times for each member are stored in a sampling
       window.
     – Estimate the distribution of the above inter arrival times.
     – Gossip follows an exponential distribution.
     – The value of PHI is now computed as follows:
         • Φ(t) = -log10( P(tnow – tlast) )
                   where P(t) is the CDF of an exponential distribution. P(t) denotes the
                   probability that a heartbeat will arrive more than t units after the previous
                   one. P(t) = ( 1 – e-tλ )
The overall mechanism is described in the figure below.
Information Flow in the
    Implementation
Performance Benchmark
• Loading of data - limited by network
  bandwidth.
• Read performance for Inbox Search in
  production:

             Search Interactions Term Search
   Min       7.69 ms            7.78 ms
   Median    15.69 ms           18.27 ms
   Average   26.13 ms           44.41 ms
MySQL Comparison
• MySQL > 50 GB Data
  Writes Average : ~300 ms
  Reads Average : ~350 ms
• Cassandra > 50 GB Data
  Writes Average : 0.12 ms
  Reads Average : 15 ms
Lessons Learnt
• Add fancy features only when absolutely
  required.
• Many types of failures are possible.
• Big systems need proper systems-level
  monitoring.
• Value simple designs
Future work
•   Atomicity guarantees across multiple keys
•   Analysis support via Map/Reduce
•   Distributed transactions
•   Compression support
•   Granular security via ACL’s
Questions?

Mais conteúdo relacionado

Mais procurados

Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
mubarakss
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
Amol Pujari
 

Mais procurados (20)

Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
 
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-FelixCassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
 
C*ollege Credit: Data Modeling for Apache Cassandra
C*ollege Credit: Data Modeling for Apache CassandraC*ollege Credit: Data Modeling for Apache Cassandra
C*ollege Credit: Data Modeling for Apache Cassandra
 
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide Deck
 
Scaling php applications with redis
Scaling php applications with redisScaling php applications with redis
Scaling php applications with redis
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XML
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Sqlxml vs xquery
Sqlxml vs xquerySqlxml vs xquery
Sqlxml vs xquery
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Cassandra Community Webinar: Back to Basics with CQL3
Cassandra Community Webinar: Back to Basics with CQL3Cassandra Community Webinar: Back to Basics with CQL3
Cassandra Community Webinar: Back to Basics with CQL3
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
SQL Track: Restoring databases with powershell
SQL Track: Restoring databases with powershellSQL Track: Restoring databases with powershell
SQL Track: Restoring databases with powershell
 
Redis in Practice: Scenarios, Performance and Practice with PHP
Redis in Practice: Scenarios, Performance and Practice with PHPRedis in Practice: Scenarios, Performance and Practice with PHP
Redis in Practice: Scenarios, Performance and Practice with PHP
 

Semelhante a Cassandra NoSQL

Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
Ryosuke IWANAGA
 

Semelhante a Cassandra NoSQL (20)

Cassandra structured storage system over a p2 p network
Cassandra structured storage system over a p2 p networkCassandra structured storage system over a p2 p network
Cassandra structured storage system over a p2 p network
 
Cassandra Nosql
Cassandra NosqlCassandra Nosql
Cassandra Nosql
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
DaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solutionDaStor/Cassandra report for CDR solution
DaStor/Cassandra report for CDR solution
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Dbms &amp; oracle
Dbms &amp; oracleDbms &amp; oracle
Dbms &amp; oracle
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Cassandra deep-dive @ NoSQLNow!
Cassandra deep-dive @ NoSQLNow!Cassandra deep-dive @ NoSQLNow!
Cassandra deep-dive @ NoSQLNow!
 
How nebula graph index works
How nebula graph index worksHow nebula graph index works
How nebula graph index works
 
SQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis Reznik
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
DBMS Chapter-3.ppsx
DBMS Chapter-3.ppsxDBMS Chapter-3.ppsx
DBMS Chapter-3.ppsx
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
 
2013 london advanced-replication
2013 london advanced-replication2013 london advanced-replication
2013 london advanced-replication
 

Mais de Murat Çakal (8)

REST vs. SOAP
REST vs. SOAPREST vs. SOAP
REST vs. SOAP
 
Mongodb open source_high_performance_database
Mongodb open source_high_performance_databaseMongodb open source_high_performance_database
Mongodb open source_high_performance_database
 
Building web applications with mongo db presentation
Building web applications with mongo db presentationBuilding web applications with mongo db presentation
Building web applications with mongo db presentation
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Trouble with nosql_dbs
Trouble with nosql_dbsTrouble with nosql_dbs
Trouble with nosql_dbs
 
NoSql databases
NoSql databasesNoSql databases
NoSql databases
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
No sql
No sqlNo sql
No sql
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Cassandra NoSQL

  • 1. Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik
  • 2. Why Cassandra? • Lots of data – Copies of messages, reverse indices of messages, per user data. • Many incoming requests resulting in a lot of random reads and random writes. • No existing production ready solutions in the market meet these requirements.
  • 3. Design Goals • High availability • Eventual consistency – trade-off strong consistency in favor of high availability • Incremental scalability • Optimistic Replication • “Knobs” to tune tradeoffs between consistency, durability and latency • Low total cost of ownership • Minimal administration
  • 4. Data Model Columns are added and ColumnFamily1 Name : MailList modified Type : Simple Sort : Name KEY Name : tid1 Name : tid2 Name : tid3 dynamically Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Column Families Name : aloha Name : dude are declared C1 C2 C3 C4 C2 C6 upfront SuperColumns V1 V2 V3 V4 V2 V6 are added and T1 T2 T3 T4 T2 T6 modified Columns are dynamically added and modified ColumnFamily3 Name : System Type : Super Sort : Name dynamically Name : hint1 Name : hint2 Name : hint3 Name : hint4 <Column List> <Column List> <Column List> <Column List>
  • 5. Write Operations • A client issues a write request to a random node in the Cassandra cluster. • The “Partitioner” determines the nodes responsible for the data. • Locally, write operations are logged and then applied to an in-memory version. • Commit log is stored on a dedicated disk local to the machine.
  • 6. Write cont’d Key (CF1 , CF2 , CF3) • Data size • Number of Objects Memtable ( CF1) • Lifetime Commit Log Memtable ( CF2) Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF2) Data file on disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> K128 Offset --- --- K256 Offset BLOCK Index <Key Name> Offset, <Key Name> Offset Dedicated Disk --- K384 Offset --- <Key name><Size of key Data><Index of columns/supercolumns>< Bloom Filter Serialized column family> (Index in memory)
  • 7. Compactions K2 < Serialized data > K4 < Serialized data > K1 < Serialized data > K10 < Serialized data > K5 < Serialized data > K2 < Serialized data > K30 < Serialized data > K10 < Serialized data > K3 < Serialized data > DELETED -- -- -- Sorted -- Sorted -- Sorted -- -- -- -- MERGE SORT Index File K1 < Serialized data > Loaded in memory K2 < Serialized data > K3 < Serialized data > K1 Offset K4 < Serialized data > K5 Offset Sorted K5 < Serialized data > K30 Offset K10 < Serialized data > Bloom Filter K30 < Serialized data > Data File
  • 8. Write Properties • No locks in the critical path • Sequential disk access • Behaves like a write back Cache • Append support without read ahead • Atomicity guarantee for a key • “Always Writable” – accept writes during failure scenarios
  • 9. Read Client Query Result Cassandra Cluster Closest replica Result Read repair if digests differ Replica A Digest Query Digest Response Digest Response Replica B Replica C
  • 10. Partitioning And Replication 1 0 h(key1) E A N=3 C h(key2) F B D 1/2 10
  • 11. Cluster Membership and Failure Detection • Gossip protocol is used for cluster membership. • Super lightweight with mathematically provable properties. • State disseminated in O(logN) rounds where N is the number of nodes in the cluster. • Every T seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list .
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Accrual Failure Detector • Valuable for system management, replication, load balancing etc. • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. • The value output, PHI, represents a suspicion level. • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions. • In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5.
  • 17. Properties of the Failure Detector • If a process p is faulty, the suspicion level Φ(t)  ∞as t  ∞. • If a process p is faulty, there is a time after which Φ(t) is monotonic increasing. • A process p is correct  Φ(t) has an ub over an infinite execution. • If process p is correct, then for any time T, Φ(t) = 0 for t >= T.
  • 18. Implementation • PHI estimation is done in three phases – Inter arrival times for each member are stored in a sampling window. – Estimate the distribution of the above inter arrival times. – Gossip follows an exponential distribution. – The value of PHI is now computed as follows: • Φ(t) = -log10( P(tnow – tlast) ) where P(t) is the CDF of an exponential distribution. P(t) denotes the probability that a heartbeat will arrive more than t units after the previous one. P(t) = ( 1 – e-tλ ) The overall mechanism is described in the figure below.
  • 19. Information Flow in the Implementation
  • 20. Performance Benchmark • Loading of data - limited by network bandwidth. • Read performance for Inbox Search in production: Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
  • 21. MySQL Comparison • MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms
  • 22. Lessons Learnt • Add fancy features only when absolutely required. • Many types of failures are possible. • Big systems need proper systems-level monitoring. • Value simple designs
  • 23. Future work • Atomicity guarantees across multiple keys • Analysis support via Map/Reduce • Distributed transactions • Compression support • Granular security via ACL’s