SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Fault tolerance in
                               Cassandra
                                       Richard Low

                                 richard@acunu.com
                                       @acunu
                                    @richardalow


Cassandra London Meetup, 5 Sept 2011

Tuesday, 6 September 2011
Menu

                   • Failure modes
                   • Maintaining availability
                   • Recovery


Tuesday, 6 September 2011
Failure modes




Tuesday, 6 September 2011
Failures are the norm


                   • With more than a few nodes, something
                            goes wrong all the time
                   • Don’t want to be down all the time


Tuesday, 6 September 2011
Failure causes

                   • Hardware failure
                   • Bug
                   • Power
                   • Natural disaster

Tuesday, 6 September 2011
Failure modes
                   • Data centre failure
                   • Node failure
                   • Disk failure



Tuesday, 6 September 2011
Failure modes
                   • Data centre failure
                   • Node failure
                   • Disk failure
                   • Temporary
                   • Permanent

Tuesday, 6 September 2011
Failure modes

                   • Network failure
                    • One node
                    • Network partition
                    • Whole data centre

Tuesday, 6 September 2011
Failure modes

       • Operator failure
        • Delete files
        • Delete entire database
        • Incorrect configuration

Tuesday, 6 September 2011
Failure modes

                   • Want a system that can tolerate all the
                            above failures
                   • Make assumptions about probabilities of
                            multiple events
                            • Be careful when assuming independence

Tuesday, 6 September 2011
Solutions

                   • Do nothing
                   • Make boxes bullet proof
                   • Replication


Tuesday, 6 September 2011
Availability
Tuesday, 6 September 2011
How maintain
                             availability in the
                            presence of failure?


Tuesday, 6 September 2011
Replication

                   • Buy cheap nodes and cheap disks
                   • Store multiple copies of the data
                   • Don’t care if some disappear


Tuesday, 6 September 2011
Replication

                   • What about consistency?
                   • What if I can’t tolerate out-of-date reads?
                   • How restore a replica?


Tuesday, 6 September 2011
RF and CL
                   • Replication factor
                    • How many copies
                    • How much failure can tolerate
                   • Consistency Level
                    • How many nodes must be contactable
                            for operation to succeed


Tuesday, 6 September 2011
Simple example
                   • Replication factor 3
                   • Uniform network topology
                   • Read and write at CL.QUORUM
                    • Strong consistency
                    • Available if any one node is down
                    • Can recover if any two nodes fail
Tuesday, 6 September 2011
In general

                   • RF N, reads and writes at CL.QUORUM
                   • Available if ceil(N/2)-1 nodes fail
                   • Can recover if N-1 nodes fail


Tuesday, 6 September 2011
Multi data centre

                   • Cassandra knows location of hosts
                    • Through the snitch
                   • Can ensure replicas in each DC
                    • NetworkTopologyStrategy
                   • => can cope with whole DC failure

Tuesday, 6 September 2011
Recovery
Tuesday, 6 September 2011
Recovery
                   • Want to maintain replication factor
                   • Ensures recovery guarantees

                   • Methods:
                    • Automatic
                    • Manual
Tuesday, 6 September 2011
Automatic


Tuesday, 6 September 2011
Automatic processes


                   • Eventually moves replicas towards
                            consistency
                   • The ‘eventual’ in ‘eventual consistency’


Tuesday, 6 September 2011
Hinted Handoff
                   • Hints
                    • Stored on any node
                    • When a node is temporarily unavailable
                    • Delivered when the node comes back
                   • Can use CL.ANY
                    • Writes not immediately readable
Tuesday, 6 September 2011
Read Repair


                   • Since done a read, might as well repair any
                            old copies
                   • Compare values, update any out of sync


Tuesday, 6 September 2011
Manual


Tuesday, 6 September 2011
Repair: method
                   • Ensures a node is up to date
                   • Run ‘nodetool -h <node> repair’
                   • Reads through entire data on the node
                   • Builds a Merkel tree
                   • Compares with replicas
                   • Streams differences
Tuesday, 6 September 2011
Repair: when

                   • After node has been down a long time
                   • After increasing replication factor
                   • Every 10 days to ensure tombstones are
                            propagated
                   • Can be used to restore a failed node

Tuesday, 6 September 2011
Replace a node: method

                   • Bootstrap new node with <old_token>-1
                   • Tell existing nodes old node is dead
                    • nodetool remove


Tuesday, 6 September 2011
Replace a node: when

                   • Complete node failure
                   • Cannot replace failed disk
                   • Corruption


Tuesday, 6 September 2011
Restore from backup:
                            method
                   • Stop Cassandra on the node
                   • Copy SSTables from backup
                   • Restart Cassandra
                    • Make take a while reading indexes

Tuesday, 6 September 2011
Restore from backup:
                             when
                   • Disk failure
                    • with no RAID rebuild available
                   • Operator error
                   • Corruption
                   • Hacker

Tuesday, 6 September 2011
Thanks :)
                              www.acunu.com
                            richard@acunu.com
                                  @acunu
                               @richardalow




Tuesday, 6 September 2011

Mais conteúdo relacionado

Destaque

How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLanandology
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Cassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationCassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationSchubert Zhang
 

Destaque (8)

How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Cassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationCassandra Compression and Performance Evaluation
Cassandra Compression and Performance Evaluation
 

Mais de Acunu

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraAcunu
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonAcunu
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Acunu
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with CassandraAcunu
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your businessAcunu
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraAcunu
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: CassandraAcunu
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraAcunu
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsAcunu
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation CassandraAcunu
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Acunu
 

Mais de Acunu (20)

Acunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on CassandraAcunu and Hailo: a realtime analytics case study on Cassandra
Acunu and Hailo: a realtime analytics case study on Cassandra
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013 Acunu Analytics and Cassandra at Hailo All Your Base 2013
Acunu Analytics and Cassandra at Hailo All Your Base 2013
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
All Your Base
All Your BaseAll Your Base
All Your Base
 
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache CassandraRealtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra
 
Realtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX LondonRealtime Analytics with Apache Cassandra - JAX London
Realtime Analytics with Apache Cassandra - JAX London
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
 
Realtime Analytics with Cassandra
Realtime Analytics with CassandraRealtime Analytics with Cassandra
Realtime Analytics with Cassandra
 
Acunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra LondonAcunu Analytics @ Cassandra London
Acunu Analytics @ Cassandra London
 
Exploring Big Data value for your business
Exploring Big Data value for your businessExploring Big Data value for your business
Exploring Big Data value for your business
 
Realtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with CassandraRealtime Analytics on the Twitter Firehose with Cassandra
Realtime Analytics on the Twitter Firehose with Cassandra
 
Progressive NOSQL: Cassandra
Progressive NOSQL: CassandraProgressive NOSQL: Cassandra
Progressive NOSQL: Cassandra
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into CassandraCassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Putting the X Factor into Cassandra
 
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source EffortsCassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
 
Next Generation Cassandra
Next Generation CassandraNext Generation Cassandra
Next Generation Cassandra
 
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Fault Tolerance in Cassandra

  • 1. Fault tolerance in Cassandra Richard Low richard@acunu.com @acunu @richardalow Cassandra London Meetup, 5 Sept 2011 Tuesday, 6 September 2011
  • 2. Menu • Failure modes • Maintaining availability • Recovery Tuesday, 6 September 2011
  • 3. Failure modes Tuesday, 6 September 2011
  • 4. Failures are the norm • With more than a few nodes, something goes wrong all the time • Don’t want to be down all the time Tuesday, 6 September 2011
  • 5. Failure causes • Hardware failure • Bug • Power • Natural disaster Tuesday, 6 September 2011
  • 6. Failure modes • Data centre failure • Node failure • Disk failure Tuesday, 6 September 2011
  • 7. Failure modes • Data centre failure • Node failure • Disk failure • Temporary • Permanent Tuesday, 6 September 2011
  • 8. Failure modes • Network failure • One node • Network partition • Whole data centre Tuesday, 6 September 2011
  • 9. Failure modes • Operator failure • Delete files • Delete entire database • Incorrect configuration Tuesday, 6 September 2011
  • 10. Failure modes • Want a system that can tolerate all the above failures • Make assumptions about probabilities of multiple events • Be careful when assuming independence Tuesday, 6 September 2011
  • 11. Solutions • Do nothing • Make boxes bullet proof • Replication Tuesday, 6 September 2011
  • 13. How maintain availability in the presence of failure? Tuesday, 6 September 2011
  • 14. Replication • Buy cheap nodes and cheap disks • Store multiple copies of the data • Don’t care if some disappear Tuesday, 6 September 2011
  • 15. Replication • What about consistency? • What if I can’t tolerate out-of-date reads? • How restore a replica? Tuesday, 6 September 2011
  • 16. RF and CL • Replication factor • How many copies • How much failure can tolerate • Consistency Level • How many nodes must be contactable for operation to succeed Tuesday, 6 September 2011
  • 17. Simple example • Replication factor 3 • Uniform network topology • Read and write at CL.QUORUM • Strong consistency • Available if any one node is down • Can recover if any two nodes fail Tuesday, 6 September 2011
  • 18. In general • RF N, reads and writes at CL.QUORUM • Available if ceil(N/2)-1 nodes fail • Can recover if N-1 nodes fail Tuesday, 6 September 2011
  • 19. Multi data centre • Cassandra knows location of hosts • Through the snitch • Can ensure replicas in each DC • NetworkTopologyStrategy • => can cope with whole DC failure Tuesday, 6 September 2011
  • 21. Recovery • Want to maintain replication factor • Ensures recovery guarantees • Methods: • Automatic • Manual Tuesday, 6 September 2011
  • 23. Automatic processes • Eventually moves replicas towards consistency • The ‘eventual’ in ‘eventual consistency’ Tuesday, 6 September 2011
  • 24. Hinted Handoff • Hints • Stored on any node • When a node is temporarily unavailable • Delivered when the node comes back • Can use CL.ANY • Writes not immediately readable Tuesday, 6 September 2011
  • 25. Read Repair • Since done a read, might as well repair any old copies • Compare values, update any out of sync Tuesday, 6 September 2011
  • 27. Repair: method • Ensures a node is up to date • Run ‘nodetool -h <node> repair’ • Reads through entire data on the node • Builds a Merkel tree • Compares with replicas • Streams differences Tuesday, 6 September 2011
  • 28. Repair: when • After node has been down a long time • After increasing replication factor • Every 10 days to ensure tombstones are propagated • Can be used to restore a failed node Tuesday, 6 September 2011
  • 29. Replace a node: method • Bootstrap new node with <old_token>-1 • Tell existing nodes old node is dead • nodetool remove Tuesday, 6 September 2011
  • 30. Replace a node: when • Complete node failure • Cannot replace failed disk • Corruption Tuesday, 6 September 2011
  • 31. Restore from backup: method • Stop Cassandra on the node • Copy SSTables from backup • Restart Cassandra • Make take a while reading indexes Tuesday, 6 September 2011
  • 32. Restore from backup: when • Disk failure • with no RAID rebuild available • Operator error • Corruption • Hacker Tuesday, 6 September 2011
  • 33. Thanks :) www.acunu.com richard@acunu.com @acunu @richardalow Tuesday, 6 September 2011