SlideShare a Scribd company logo
1 of 22
Download to read offline
Managing a
Cassandra Cluster:
Lessons Learned from 3M+
Node-Hours of Experience
19 April 2016
Agenda
• About Instaclustr & presenters
• Foundation practices for a happy cluster
• The most common Cassandra issues and how to avoid them
• Important monitoring and health-check procedures
• Q & A
About Instaclustr
• Cassandra, Spark and Zeppelin Managed Service on AWS, Azure and
SoftLayer
• 500+ nodes
• >80 clusters from 3 nodes to 50+
• >3M node-hours of C* experience
• Cassandra and Spark Consulting & Support
• Ben Slater – Chief Product Officer
• Aleks Lubiejewski – VP Consulting & Security
(prev VP Support)
Foundation Practices
What do I need to get right at the beginning to have a
happy cluster?
Pick the right Cassandra version
• Most stable -> Cassandra 2.1 (or, better yet, DSE)
• Want the latest features and can live with “cutting edge” or have a
few months until production -> Cassandra 3.x
• Odd numbers (eg 3.5) should be more stable as these are defect-fix releases
• Cassandra 2.2: same end of life as Cassandra 2.1, no DSE version
(yet?). Only recommended if you really want the new features but
don’t want to jump to 3.x.
Appropriate Hardware Configuration
• Solid State Disks
• Lots of memory
• Can use EBS on AWS (and equivalent on other platforms) but needs
very careful attention to configuration
• For the cloud, we prefer more, smaller nodes:
• In AWS, m4.xlarge with 800G or 1.6TB are our standard building blocks
• Smaller proportionate impact of failure of a node
• Reasonable time to replace a failed node or add new ones
Estimating Costs in the Cloud
Cost Description Driver
Instances Cost of the base compute instances (eg m4.xl). Number and size of nodes in the cluster.
EBS Volume Cost of attached EBS volumes (where applicable) Size of the EBS volume (eg 400GB)
Network – Public IP
In/Out
Loading/retrieving data via public IP Only applicable if accessing via Public IP: dependant on number
of Cassandra read/writes in a month and transaction size.
Network – Interzone
In/Out
Cross-availability zone communication within the
cluster
Transaction volume and size, consistency factor used for reads
Network –
VPC In/Out
Loading/retrieving data via a peered VPC Only applicable if accessing via Peered VPC: dependant on
number of Cassandra read/writes in a month and transaction
size.
S3 Storage S3 space for storing backups Volume of data, length of backup retention, deduplication of
backup files/data
S3 Operations S3 calls for storing backups Number of sstables (volume of data + compaction strategy),
backup strategy
S3 Data Transfer Out S3 retrieval data transfer cost Only applicable if you need to copy data from S3 to a region
other than US East to restore a backup.
• EBS and network costs can exceed instance cost in some circumstances
Cassandra–Specific Load Testing
• Short load tests from the application side can be misleading.
• Need to consider:
• Is there enough data on disk to overflow the operating system file cache?
• Does your data reflect production distributions, particularly for primary key
fields?
• Are you performing deletes/updates to test for
impact of tombstones (virtual deletes)?
• If you are using cassandra-stress,
do you understand the available
options and their impacts?
NetworkTopologyStrategy,
RF=3,CL=Quorum
• NetworkTopologyStrategy
• Most “future proof” strategy
• Allows use of multi-dc
• Can be very useful for cluster operations such as upgrades and splitting out tables
• Replication Factor = 3, Consistency Level = Quorom
• Provides expected behaviour for most people:
• Strong consistency
• Availability through failure of a replica
• Other settings are valid for many uses cases but need careful consideration of
the impacts
Most Common Issues
What are the most common causes when things go
wrong in production?
Data Modelling Issues
• Partition Keys
• Cassandra primary keys consist of a partition key and a clustering key.
• Eg PRIMARY KEY ((c1, c2), c3, c4)
• Partition keys determine how data is distributed around the nodes
• 1 node per partition, many partitions per node
• Need to ensure there are are a large number of partitions with a reasonable
number of rows per partitions
• Small number or very uneven partitions defeat the basic concepts of
Cassandra scaling
• Very large partitions can causes issues with garbage collection and excess disk
usage
Data Modelling Issues (2)
• Tombstones
• Tombstones are entries created to mark the fact that a record has been deleted
(updates to primary key will also cause tombstones)
• By default, tombstones are retained for 10 days before being removed by
compactions
• High ratios of tombstones to live data can cause significant performance issues
• Secondary Indexes
• Secondary Indexes are useful for a limited set of cases
• only index low (but not too low) cardinality columns;
• don’t index columns that are frequently updated or deleted
• Poor use of secondary indexes can result in poor read performance and defeat
scalability
Other Issues
• Write Overload
• Cluster may be able to initial handle a write workload but then fail to keep up
as compactions kick in
• The effects can go beyond slow latency and cause crashes
• Garbage Collection (GC)
• Long garbage collection pauses can often cause issues
• Typically a symptom of partitions being too big or general overload of cluster
• Tuning GC settings and heap allocations can help but need to address root
cause also
Running out of capacity
• Cassandra scales infinitely but processing capacity is used when
adding new nodes to a cluster.
• Therefore, you need to add capacity well before existing capacity is
exhausted.
• This applies to both disk and processor/memory.
Fundamental Monitoring &
Health Check Procedures
How do I get advanced warning if my cluster is going to
hit issues?
Monitoring – Basic Metrics (OS)
• Disk usage
• less than 70% under normal running is a good starting guide
• this allows for bursts of usage by compactions and repairs
• extreme cases may required 50% free space
• Levelled compaction strategy and data spread across multiple column families
can allow higher disk usage
• CPU Usage
• again, 70% utilization is a reasonable target
• Keep a look out for high iowait – indicates storage bottleneck
Monitoring – Basic Metrics (C*)
• Read/Write Latency
• closely tied to user experience for most use cases
• monitor for significant changes
• be aware that read latency can very greatly depending on number of rows
returned
• distinguish between changes impacting a specific column family (likely data
modelling issues) and changes impacting a specific node (hardware or
capacity issues)
• Pending Compactions
• increase numbers of pending compactions indicates that a node is not
keeping up with workload
Cassandra Logs
• Regularly inspecting Cassandra logs for warnings and errors is
important for picking up issues. The issues you will find include:
• large batch warnings
• compacting large partitions warnings
• reading excess tombstones warnings
• plenty more!
Apr 18 08:00:27 ip-10-224-111-138.ec2.internal docker[15521]: [Native-Transport-Requests:22756] WARN
org.apache.cassandra.cql3.statements.BatchStatement Batch of prepared statements for [ks.col_family] is of size
59000, exceeding specified threshold of 5120 by 53880.
ip-10-224-169-153.eu-west-1.compute.internal docker[25837]: WARN o.a.c.io.sstable.SSTableWriter Compacting large
partition ks/col_family:c7d65814-1a58-4675-ad54-d6c92e10d1d7 (405357404 bytes)
Mar 29 11:55:26 ip-172-16-151-148.ec2.internal docker[30099]: WARN o.a.c.db.filter.SliceQueryFilter Read 3563 live
and 3520 tombstone cells in ks.col_family for key: 4364012 (see tombstone_warn_threshold). 5000 columns were
requested, slices=[2016-03-28 11¥:55Z:!-]
cfstats and cfhistograms
• The Cassandra nodetool tool has many commands that help diagnose
issues.
• nodetool cfstats and cfhistograms are two of the most important
• These tools can help you see:
• Large and uneven partitions
• Excess tombstones
• Too many sstables per read
• read and write latency by keyspace and column family
• many more
Summary
Cassandra is an incredibly reliable and scalable technology
if
you design and build correctly from the start
and follow basic management procedures.
Thank you for listening.
QUESTIONS?
Contact:
• www.instaclustr.com
• Hiro Komatsu – hiro.komatsu@instaclustr.com
• Ben Slater – ben.slater@instaclustr.com
• Aleks Lubiejewski – aleks@instaclustr.com

More Related Content

What's hot

Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
DataStax
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 

What's hot (20)

Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applications
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Processing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and SparkProcessing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and Spark
 
Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%
 
Deep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraDeep dive into event store using Apache Cassandra
Deep dive into event store using Apache Cassandra
 
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social Media
 

Similar to Cassandra CLuster Management by Japan Cassandra Community

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
Cassandra
CassandraCassandra
Cassandra
exsuns
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
jbellis
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardOPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 

Similar to Cassandra CLuster Management by Japan Cassandra Community (20)

What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & ToubleshootingInstaclustr Apache Cassandra Best Practices & Toubleshooting
Instaclustr Apache Cassandra Best Practices & Toubleshooting
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Cassandra
CassandraCassandra
Cassandra
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application Enablement
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Planning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera ClusterPlanning for Disaster Recovery (DR) with Galera Cluster
Planning for Disaster Recovery (DR) with Galera Cluster
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardOPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Recently uploaded (20)

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 

Cassandra CLuster Management by Japan Cassandra Community

  • 1. Managing a Cassandra Cluster: Lessons Learned from 3M+ Node-Hours of Experience 19 April 2016
  • 2. Agenda • About Instaclustr & presenters • Foundation practices for a happy cluster • The most common Cassandra issues and how to avoid them • Important monitoring and health-check procedures • Q & A
  • 3. About Instaclustr • Cassandra, Spark and Zeppelin Managed Service on AWS, Azure and SoftLayer • 500+ nodes • >80 clusters from 3 nodes to 50+ • >3M node-hours of C* experience • Cassandra and Spark Consulting & Support
  • 4. • Ben Slater – Chief Product Officer • Aleks Lubiejewski – VP Consulting & Security (prev VP Support)
  • 5. Foundation Practices What do I need to get right at the beginning to have a happy cluster?
  • 6. Pick the right Cassandra version • Most stable -> Cassandra 2.1 (or, better yet, DSE) • Want the latest features and can live with “cutting edge” or have a few months until production -> Cassandra 3.x • Odd numbers (eg 3.5) should be more stable as these are defect-fix releases • Cassandra 2.2: same end of life as Cassandra 2.1, no DSE version (yet?). Only recommended if you really want the new features but don’t want to jump to 3.x.
  • 7. Appropriate Hardware Configuration • Solid State Disks • Lots of memory • Can use EBS on AWS (and equivalent on other platforms) but needs very careful attention to configuration • For the cloud, we prefer more, smaller nodes: • In AWS, m4.xlarge with 800G or 1.6TB are our standard building blocks • Smaller proportionate impact of failure of a node • Reasonable time to replace a failed node or add new ones
  • 8. Estimating Costs in the Cloud Cost Description Driver Instances Cost of the base compute instances (eg m4.xl). Number and size of nodes in the cluster. EBS Volume Cost of attached EBS volumes (where applicable) Size of the EBS volume (eg 400GB) Network – Public IP In/Out Loading/retrieving data via public IP Only applicable if accessing via Public IP: dependant on number of Cassandra read/writes in a month and transaction size. Network – Interzone In/Out Cross-availability zone communication within the cluster Transaction volume and size, consistency factor used for reads Network – VPC In/Out Loading/retrieving data via a peered VPC Only applicable if accessing via Peered VPC: dependant on number of Cassandra read/writes in a month and transaction size. S3 Storage S3 space for storing backups Volume of data, length of backup retention, deduplication of backup files/data S3 Operations S3 calls for storing backups Number of sstables (volume of data + compaction strategy), backup strategy S3 Data Transfer Out S3 retrieval data transfer cost Only applicable if you need to copy data from S3 to a region other than US East to restore a backup. • EBS and network costs can exceed instance cost in some circumstances
  • 9. Cassandra–Specific Load Testing • Short load tests from the application side can be misleading. • Need to consider: • Is there enough data on disk to overflow the operating system file cache? • Does your data reflect production distributions, particularly for primary key fields? • Are you performing deletes/updates to test for impact of tombstones (virtual deletes)? • If you are using cassandra-stress, do you understand the available options and their impacts?
  • 10. NetworkTopologyStrategy, RF=3,CL=Quorum • NetworkTopologyStrategy • Most “future proof” strategy • Allows use of multi-dc • Can be very useful for cluster operations such as upgrades and splitting out tables • Replication Factor = 3, Consistency Level = Quorom • Provides expected behaviour for most people: • Strong consistency • Availability through failure of a replica • Other settings are valid for many uses cases but need careful consideration of the impacts
  • 11. Most Common Issues What are the most common causes when things go wrong in production?
  • 12. Data Modelling Issues • Partition Keys • Cassandra primary keys consist of a partition key and a clustering key. • Eg PRIMARY KEY ((c1, c2), c3, c4) • Partition keys determine how data is distributed around the nodes • 1 node per partition, many partitions per node • Need to ensure there are are a large number of partitions with a reasonable number of rows per partitions • Small number or very uneven partitions defeat the basic concepts of Cassandra scaling • Very large partitions can causes issues with garbage collection and excess disk usage
  • 13. Data Modelling Issues (2) • Tombstones • Tombstones are entries created to mark the fact that a record has been deleted (updates to primary key will also cause tombstones) • By default, tombstones are retained for 10 days before being removed by compactions • High ratios of tombstones to live data can cause significant performance issues • Secondary Indexes • Secondary Indexes are useful for a limited set of cases • only index low (but not too low) cardinality columns; • don’t index columns that are frequently updated or deleted • Poor use of secondary indexes can result in poor read performance and defeat scalability
  • 14. Other Issues • Write Overload • Cluster may be able to initial handle a write workload but then fail to keep up as compactions kick in • The effects can go beyond slow latency and cause crashes • Garbage Collection (GC) • Long garbage collection pauses can often cause issues • Typically a symptom of partitions being too big or general overload of cluster • Tuning GC settings and heap allocations can help but need to address root cause also
  • 15. Running out of capacity • Cassandra scales infinitely but processing capacity is used when adding new nodes to a cluster. • Therefore, you need to add capacity well before existing capacity is exhausted. • This applies to both disk and processor/memory.
  • 16. Fundamental Monitoring & Health Check Procedures How do I get advanced warning if my cluster is going to hit issues?
  • 17. Monitoring – Basic Metrics (OS) • Disk usage • less than 70% under normal running is a good starting guide • this allows for bursts of usage by compactions and repairs • extreme cases may required 50% free space • Levelled compaction strategy and data spread across multiple column families can allow higher disk usage • CPU Usage • again, 70% utilization is a reasonable target • Keep a look out for high iowait – indicates storage bottleneck
  • 18. Monitoring – Basic Metrics (C*) • Read/Write Latency • closely tied to user experience for most use cases • monitor for significant changes • be aware that read latency can very greatly depending on number of rows returned • distinguish between changes impacting a specific column family (likely data modelling issues) and changes impacting a specific node (hardware or capacity issues) • Pending Compactions • increase numbers of pending compactions indicates that a node is not keeping up with workload
  • 19. Cassandra Logs • Regularly inspecting Cassandra logs for warnings and errors is important for picking up issues. The issues you will find include: • large batch warnings • compacting large partitions warnings • reading excess tombstones warnings • plenty more! Apr 18 08:00:27 ip-10-224-111-138.ec2.internal docker[15521]: [Native-Transport-Requests:22756] WARN org.apache.cassandra.cql3.statements.BatchStatement Batch of prepared statements for [ks.col_family] is of size 59000, exceeding specified threshold of 5120 by 53880. ip-10-224-169-153.eu-west-1.compute.internal docker[25837]: WARN o.a.c.io.sstable.SSTableWriter Compacting large partition ks/col_family:c7d65814-1a58-4675-ad54-d6c92e10d1d7 (405357404 bytes) Mar 29 11:55:26 ip-172-16-151-148.ec2.internal docker[30099]: WARN o.a.c.db.filter.SliceQueryFilter Read 3563 live and 3520 tombstone cells in ks.col_family for key: 4364012 (see tombstone_warn_threshold). 5000 columns were requested, slices=[2016-03-28 11¥:55Z:!-]
  • 20. cfstats and cfhistograms • The Cassandra nodetool tool has many commands that help diagnose issues. • nodetool cfstats and cfhistograms are two of the most important • These tools can help you see: • Large and uneven partitions • Excess tombstones • Too many sstables per read • read and write latency by keyspace and column family • many more
  • 21. Summary Cassandra is an incredibly reliable and scalable technology if you design and build correctly from the start and follow basic management procedures.
  • 22. Thank you for listening. QUESTIONS? Contact: • www.instaclustr.com • Hiro Komatsu – hiro.komatsu@instaclustr.com • Ben Slater – ben.slater@instaclustr.com • Aleks Lubiejewski – aleks@instaclustr.com