I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
This is the old version of this presentation for Cassandra 2.0 and earlier. Check out the updated slide deck for Cassandra 2.1.
Role at DataStax
Previous Experience
Something about Atlanta
Overall troubleshooting process from a support engineer’s perspective
I’ll focus on the tools Cassandra gives you to do steps 1 to 4
Steps 5 and 6 are the hard parts; but luckily that’s what DataStax support, or the mailing list, or StackOverflow can help you with
Doing the legwork on the first part will make the second part happen much faster
From a user’s perspective, the troubleshooting process could be revised like this
Definitely better to over-share than under-share information
Make sure you share all the facts, not just those that support your current theory
When troubleshooting, it’s helpful to keep in mind the various processes Cassandra runs to do its work
These can be roughly divided into:
startup processes
Foreground processes (serving reads and writes); your application has control over these!
Background processes (happen periodically); Cassandra decides when to do them, but they can be tuned in many cases
It’s also helpful to keep in mind the various system resources that Cassandra consumes
CPU
keep in mind both the speed of a single core and of multiple cores
necessary because some processes are single-threaded and bottlenecked by a single core
Memory
Heap space, typically limited to a subset of your total physical memory
Cassandra stores many objects off heap to avoid Garbage Collection issues
The OS will use whatever memory is left over for page cache; make sure you leave enough free memory for this
Garbage collection
Driven by memory utilization
Drives CPU utiliization
Disk
Available disk space
I/O bandwidth utilization
Network
Primarily concerned with bandwidth
Keep in mind firewalls; the path needs to be open
OS Resources/Limits
File handles, processes, etc.
Make sure you set high enough limits in limits.conf or ulimits
Overview of some of the most useful nodetool commands and what they do
Nodetool status gives an overview of the entire cluster’s status
The cluster is divided into datacenter
The first column shows whether a node is up or down
The second column shows the node’s state, which one of:
normal
leaving the cluster
joining the cluster
moving its token
The IP address of each node.
This will be the broadcast_address if specified; otherwise, the listen_address
The amount of data on each node
It is normal for this to differ slightly
Large discrepancies could indicate a problem (in terms percentage)
Wide rows/partitions
Uneven racks
Compaction issues
Number of tokens
Normally 256 for vnodes, 1 for non-vnodes
Percentage of ring each node owns
Note message at the top
Without a keyspace, assumes SimpleStrategy with RF=1
Can cause strange readout when using multiple DCs with a small offset between tokens (nothing to worry about)
With keyspace, shows ownership according to RF in that keyspace
With keyspace, should add up to RF times 100%
May differ slightly from node to node with vnodes due to random token distribution
UUID of the node. Used to uniquely identify the node when running removenode command.
The rack the node is on
Used to avoid single point of failure
Avoid if not using vnodes
If using vnodes, ensure the same number of nodes are in each rack
In this example, we have one node that is down, so that is where we’d focus our investigation
Nodetool ring is an old way of checking status
Shows the same information as nodetool status, except for the token
When using vnodes, it will show every token on each node and becomes difficult to read
Nodetool info shows information specific to the node where it is run
Some of the information is also shown by nodetool ring/status
Same information shown by nodetool status and ring
Not going to rehash this
Whether gossip, thrift, and native transport are enabled
Can be individually enabled/disabled with nodetool commands
Gossip generation; increases each time the node is restarted
How many seconds the node has been running
Heap and off heap memory usage
Heap memory shows amount currently in use and maximum
Amount currently in use may include garbage
The amount of garbage included varies depending on when GC was last run
Off heap memory is stored outside the heap but adds to the process’s overall resident size
If a process gets killed by the kernel OOM killer, make sure it isn’t using too much off heap memory
Number of exceptions that occurred since last restart
Not every error involves an exception, but it’s still a good indicator
Key Cache size and capacity
Size is amount actually in use
Capacity is amount specified by key_cache_size_in_mb in cassandra.yaml
Capacity defaults to 100MB or 5% of the heap, whichever is less
If size is consistently much less than capacity, you have this set too high
Cache hits, total requests, and hit rate
- If hit rate is low, try increasing key cache capacity slowly until you see diminishing returns
Key cache is periodically saved to disk and reloaded when a node is restarted
This is to avoid a cold key cache after restart
Same information shown for row cache
Row cache should be disabled except for extremely narrow set of use cases
Small, very hot data set that will fit in memory
Data should be read much more than written because writes invalidate row in cache
But really, don’t use row cache
tpstats shows thread pool statistics
Cassandra has various thread pools that handle important foreground and background tasks
This information is also logged to system.log by StatusLogger.java periodically or when a message is dropped
Name of the thread pool
Number of tasks being actively serviced by a thread
For several stages, this can be configured in cassandra.yaml
For others, it is equal to the number of CPU cores
For a few others, it is a hardcoded limit, usually 1
Number tasks waiting to be serviced
For most stages, limited to ~2 billion
You’ll run out of memory long before you ever hit this limit
High number of pending tasks indicates the stage is overloaded
Number of tasks completed since last restart
Number of tasks currently blocked for I/O
Should almost always be 0
Total number of tasks blocked since last restart.
Usually zero except for FlushWriter
At the bottom is a list of message types and the number of messages dropped
Load shedding drops requests that have been pending beyond timeout specified in cassandra.yaml
Dropped messages usually indicate overloaded cluster; check for other causes, then add more nodes
Handles local reads for which this node is a replica
Number of threads is controlled by concurrent_reads in cassandra.yaml
Various reads, all handled by ReadStage
READ - normal read on a single partition
RANGE_SLICE - A sequential or secondary index scan over multiple partitions
PAGED_RANGE - Used for automatic paging when result size exceeds row limit
Handles local writes for which this node is a replica
Number of threads is controlled by concurrent_writes in cassandra.yaml
Various writes, handled by MutationStage
READ_REPAIR - write due to a read repair
MUTATION - normal write
COUNTER_MUTATION - incrementing a counter
Coordinator uses this to process responses from other nodes
Roughly indicates how often this node has been a coordinator
Roughly because unless using CL ONE, need to handle responses from multiple nodes
Request completed but timed out before coordinator could respond to it
Timeout controlled by request_timeout_in_ms in cassandra.yaml
May indicate that coordinator is overloaded
Ensure that client load balancing is set up correctly
Make sure use of batches is appropriate
Logged batches only when atomicity is required
Unlogged batches only when updating multiple rows with the same partition key
Otherwise use asynchronous execution to pipeline requests without overloading a single coordinator
Writes memtables to disk
Number of threads controlled by memtable_flush_writers, should equal number of data drives
Maximum pending tasks controlled by memtable_flush_queue_size in cassandra.yaml
Once queue is full writes are blocked until another flush writer is available
Large number of all-time-blocked tasks indicates a disk bottleneck; add more/faster disks or more nodes
Handles compactions
Number of threads controlled by concurrent_compactors in cassandra.yaml
Constantly pending compactions means compactions can’t keep up with writes; mitigation strategies:
Switch from LeveledCompactionStrategy to SizeTieredCompactionStrategy for write-heavy tables
Get faster disks (SSD) if needed
Increase compaction throughput
Get faster CPU cores
Asynchronous read repairs
Occur for a certain percentage of reads, configurable per table
Pending tasks indicate that you may have read repair chance set too high
Handles hint delivery from the coordinator to a node that’s recently come back up
Large number of hints usually means an unhealthy cluster
Nodes gossip with each other once a second
Prior to 2.0, when using vnodes with a large cluster, gossip could become CPU-bound and get behind
Migrates schema changes to other nodes
If you see pending tasks here, you’re making too many schema changes
Repair related stages
AntiEntropyStage coordinates repairs
AntiEntropySesssions are active repairs in progress
ValidationExecutor builds merkle trees for repair
cfstats shows statistics for individual tables
Tables are grouped by keyspace
Values apply only to the node where cfstats is run, not cluster-wide
Keyspace and table names
Multiple tables grouped under each keyspace
Read and write counts
Per table table and at keyspace level
Read and write latency
Per table and averaged at keyspace level
Number of thread pool tasks pending against a table
Per table and summed at keyspace level
Number of sstables comprising a table
Broken down by level when using LeveledCompactionStrategy
Space used by the table on this node
Must sum across different nodes to get total space
May include deleted or updated data that hasn’t been compacted yet
Number of partition keys in the table on this node
Estimated to the nearest index_interval (128 by default)
Rows spread across multiple sstables will inflate this number
Space consumed by off-heap data structures
Total off head memory
Broken down by data structure: bloom filter, index summary, compression metadata
Memtable information
number of entries in the memtable
bytes of data in the memtable
the number of times the memtable has been switched (flushed to disk)
Bloom filter statistics
False positives
Too high, performance will suffer
Reduce false positive chance for table
Space used
Lower false positive chance requires more space
Bloom filter grows linearly with number of partitions
Increase false positive if bloom filter is too large
Statistics on partition size, calculated during compaction
Maximum, minimum, and average size
Helps identify tables containing large partitions
Tombstone statistics
Number of live cells versus tombstones encountered when scanning a partition
Rolling average for the last 5 minutes
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
Keyspace and table name
Number of sstables each read hit
More sstables means slower reads, higher I/O utilization
If reads are hitting too many sstables, consider LeveledCompactionStrategy
LCS incurs more I/O due to compaction on write-heavy tables
The size of each partition in bytes
Scanning through large partitions generates garbage and hinders performance
Large partitions cause uneven data distribution across nodes
Number of cells in each partition
Total number of columns belonging to all logical rows within the partition
Large number of cells may cause increased garbage generation
How to read cfhistograms
Left side shows list of buckets
Bucket ranges from previous line (exclusive) to current line (inclusive)
Buckets get progressively larger as numbers increase
Count of items that fall in this bucket
What’s being counted depends on the statistic
Heading tells you what’s being counted
For “SSTables per Read”, we’re counting number of reads
For the other two, we’re counting the number of partitions
Also pay attention to the units used for the buckets
Example
126,087 reads hit 4 sstables
37 reads hit between 9 and 10 sstables
Example
73 partitions between approx 195K and 234K bytes
Example
14,532 partitions with between 771 and 924 cells in them
Read latency for local reads
Write latency for local writes
Unit is microseconds, not milliseconds
Example
1379 reads served locally by this node took between 925 and 1109 us
Example
1839 writes served locally by this node took between 216 and 258 us
Shows the latency for reads and writes that this node coordinated
Latency for the entire request, including network latency between coordinator and replicas
Not broken down by table
Status of current and pending compactions
Number of pending compactions
Compactions in progress
Compaction Type
Compaction - normal compaction
Validation - building merkle trees for repair
Keyspace and table
Bytes complete and total bytes for each compaction
Percent done
Estimated time remaining for the active tasks
Not useful, because:
estimates can be wrong
doesn’t account for pending tasks
History of recent compactions
Shows how much space a compaction reclaimed
Same information available in system.log
Unique ID
Keyspace and column family
Unix timestamp when compaction occurred
Seconds since Jan 1, 1970
Total bytes before compaction
Total bytes after compaction
Row merge counts
Actually the last column on each row
Moved to make output fit on slide
Count of sstables
Number of rows spread across that many sstables prior to compaction
Shows network activity
Mode, same as on status: NORMAL, JOINING, LEAVING, MOVING
Active streams
Repair session IDs
Nodes exchanging data
Data is streaming over listen address instead of broadcast address
Useful on EC2 because Amazon doesn’t charge for internal traffic
Total number of files and bytes of data to be received from specified node
Total number of files and bytes of data to be sent to specified node
Specific files currently being transferred
Progress for each file
Read repair statistics
Number of read repairs attempted
Number of mismatches resolved in the foreground
Number of mismatches resolved in the background
Commands sent and responses received while acting as coordinator
Number of pending and completed commands sent to other nodes
Number of pending and completed responses received from other nodes
Used to check for schema disagreements
This is not the information we’re interested in
We’re looking to see how many schema versions there are in the cluster.
No disagreement; only one version of the schema shared by all nodes
Schema disagreement; one node has a different version from all the others
Schema disagreements must be manually resolved
If only one node disagrees, run nodetool resetlocalschema on that node
If multiple nodes disagree
shut down nodes in the minority and delete system/schema_* sstables
start nodes back up one by one
system.log is the most important tool for troubleshooting Cassandra
Where it is
How to configure it
logging level
location
override logging level for a specific class
Basic format
Level: INFO, WARN, ERROR by default; DEBUG only if configured
Thread: use the ID to correlate messages from the same thread
Date & Time: use to correlate messages across multiple nodes, time duration of events
Source File/Line No: code that logged the message, not necessarily where an error occurred; talk about stack traces later
Exception - What kind of error occurred?
Stack trace – where did the error happen?
Most local at top to most global at bottom
Wall of text — we’ll dissect it in the next slides
Organization names – whose fault is it?
Sub-packages usually group major application subsystems
Class Name – specific object
Will usually, but not always match the filename
Nested Classes - $ separates outer class from inner class(es)
Method belongs to the inner class
Method name – what was the class doing?
<init> indicates that the error occurred in a static initialization block or constructor
File name – where the source code is, should you want to look at it
Also pay attention to the package name so you can find the file within the nested directory structure
When source-diving, start with the most local method and work your way out
Line number – where to look in the code (available on github.com/apache/cassandra)
Be careful! Line numbers change between versions, so make sure you select the the right version in github
Pay attention to nested exceptions
Each exception has its own stack trace which may be completely different
The outer exception may be too general because it’s been rethrown from unrelated code
The innermost exception will be the actual root cause of the error
Best to use a combination of outer and inner exception as search terms
Exception will usually have an error message
Provides additional information about the circumstances of the exception
Usually good to search for the exception and message together
Look out for embedded numbers or strings; these may change from one message to the next, and including them will undesirably narrow your search
- Some additional examples of organizations and subsystems
- Some additional examples of organizations and subsystems
Use exception and several package+class+method names
Exception alone often isn’t sufficient because the same error can occur many different places
You’ll find the same exception in unrelated software if it’s a standard java exception.
Add several package/class/method combinations to narrow down the exception
Use at least the topmost method and the first org.apache.cassandra method
Use quotes around individual elements (especially if they contain spaces)
Line numbers shouldn’t be part of your search criteria because you may not find the same error in a different version
Likewise, exclude specific numbers and strings like names and counts from your search
Use Google’s site: feature to narrow search terms to apache JIRA, cassandra mailing list, or stackoverflow
Add or remove additional methods as needed to narrow or broaden search
These might be a good set of search terms for this exception
Include both exceptions and methods from both stack traces
Include exception and error message grouped together inside quotes
Know how to recognize a restart
Check versions of major components and JVM
Confirm settings are what you think they are
Make sure you have JNA installed
Know when node is ready to serve requests
Cassandra writes updates to the commit log on disk and the memtables in memory
When memtables get full, they’re flushed to disk and the associated commit log is reclaimed
Flushes can also be triggered when the node is running low on memory
First the flush is enqueued
There are a limited number of FlushWriter threads
Flushes wait in the queue until a FlushWriter is available
- When a FlushWriter becomes available, the flush begins
Eventually the flush completes.
This is not an instantaneous process because disks are slow.
This is the name of the table and the unique identifier of the memtable
You can use this to link the enqueuing and writing of the flush
Make note of the FlushWriter thread doing the flush
You can use this to link the beginning and end of the flush
This shows the number of serialized and live bytes
Serialized vs live is the size of the data on disk vs the size in heap
It also shows the number of individual write operations stored in the memtable
- This shows the name of the sstable that the memtable was written to on disk
The size of the sstable
This is typically smaller than the size in memory because of compression
This shows the segment ID for the commit log and the position within the segment
The commit log is broken up into small files called segments
When all the commits in a segment are flushed, the segment is reclaimed
Note the times on the messages
Time between first and second messages is how long the flush waited in the queue
Time between second and third messages is how long the flush took to complete
Sstables are immutable; updates go into new sstables
Reads scan over multiple sstables to stitch together a row
SSTables must eventually be compacted together to keep reads fast
Size-tiered compactions occur when a sufficient number of similarly sized sstables exist
In leveled compaction, sstables are written to Level 0 and moved to higher levels as they are compacted
Leveled compactions occur continuously as long as sstables exist in Level 0
Note the CompactionExecutor thread doing the compaction
The thread ID can be used to link together the messages
The compaction is beginning
The sstables that are going to be compacted
The compaction is complete
How many tables were compacted
The name of the new sstable created by the compaction
The number of bytes in the original files
The number of bytes in the new file
The percentage of the original size after:
Updates were merged
Tombstones and expired TTLs were removed
The time the compaction took and the rate in MB/sec
The sum of the number of rows in each sstable
The number of unique rows across all compacted sstables
X:Y where Y rows were split across X sstables
During compaction, you may see one or more messages about partitions being compacted incrementally
Logical CQL 3 rows sharing the same partition key form a physical row when stored in the cluster
Newer versions of Cassandra say partition instead of row
Large partitions can cause a number of problems for Cassandra
Uneven distribution of data between nodes
Large memory usage when large partitions are read all at once
Generating lots of garbage compacting a wide row
Slower compactions because they’re done incrementally on disk
These messages can help identify large rows
The keyspace, table, and partition key
The size of the partition
Garbage collections are a necessary evil
Some garbage collections run concurrently with Cassandra, but others stop the world
Cassandra logs any stop-the-world collections that last longer than 200ms
Stop the world collections cause nodes to stop responding to gossip and client requests
This shows the type of collection
Java allocates new objects into the young generation
Objects that survive a specified number of collections get promoted to the tenured generation
ParNew collections occur when the young gen is collected. These are stop-the-world.
The young gen is usually small so ParNew is usually fast
If there’s not enough contiguous space in the old gen to promote an object, the old gen must be compacted, which takes a long time
ConcurrentMarkSweep normally runs concurrently with the application and does not stop the world
If the concurrent collection can’t keep up with the rate at which garbage is generated, a stop-the-world collection occurs
Stop-the-world CMS can take a very long time because the old gen is usually big
This shows the number of ms elapsed for the collection
Long collections increase latency of read and write requests
Very long collections will prevent the node from gossiping and other nodes will think it’s down
Note the time on each GCInspector message.
Even if the individual collections are fast, too many collections within a short timespan can hurt throughput
This shows the number of individual collections reported included in the duration
This shows the amount of heap in use before the collection occurred.
This shows the maximum amount of heap available
After a collection completes, warning is logged if the heap is still greater than 75% full
If enough space can’t be reclaimed repeatedly, the node may enter a GC death spiral
When the threshold is exceeded, it triggers a memtable flush to free up memory
This threshold can be configured using flush_largest_memtables_at in cassandra.yaml
Usually, you shouldn’t change this setting
If you see this message, you should try to reduce heap usage or increase heap size
Increasing heap size above 8GB is not recommended because GC will take longer
Flapping is often caused by garbage collections
The nodes go up-down-up-down, repeatedly
That’s why it’s called flapping
Notice the nodes that are flapping
If a single node is flapping, check the logs on that node during the same timeframe and see if GC is occurring
If multiple nodes are are reported up and down, the local node may be the problem
If a node is doing GC it won’t be able to receive gossip messages from another node and may think they’re down
Check for GC messages in the local log around the time that flapping occurs
Note the time that flapping occurs
If it happens infrequently, it may not be a problem
If it happens multiple times a minute, it is a problem
Check other node’s logs for GC events that occurred at the same time
If you don’t see anything in the other node’s log, it may be a network issue
You can reduce the failure detector’s sensitivity by increasing phi_convict_threshold in cassandra.yaml
Default value is 8; maximum recommended value is 12 (useful on high latency networks such as AWS)
If a write comes in for a down node, the coordinator will store hints for it
When the node comes back up, the nodes that have hints for it will send them
Hints are no longer stored after the node has been down over period of time specified in cassandra.yaml.
This is to prevent a node that comes back up from being inundated with more hints than it can handle
Any node that has been down longer than this period of time needs to run nodetool repair
Flapping can cause excessive hint buildup, which adds extra burden for both the coordinator and the node that is flapping
This can lead to cascading failures
Repairs are initiated by running nodetool repair
They do a full comparison of all the data for a particular token range with the other replicas for that range, then exchange any data that is out of sync
system.log shows the process from start to finish
Note the UUID for the repair session. This is your key to correlating the various messages.
When a repair session begins, you will see a “new session” message
It will report the nodes it’s going to sync with
The token ranges it’s going to sync
And the keyspace and column families it’s going to sync
The first step is for the repair leader to request merkle trees from all the other replicas
A message is logged reporting the receipt of each requested merkle tree
Make sure you see a message that the merkle tree was received from each node that it was requested from
After comparing the merkle trees, if the nodes are in sync, you’ll see a message like this
If not, you’ll see a message like this, reporting how many ranges are out of sync
The node will then begin a streaming repair with the out-of-sync replica
Another message reports when the streaming task has succeeded
A message will report when each table has been fully synced. This means either it was in sync to begin with, or all the streaming tasks necessary to sync it completed.
Once all tables are synced, a message will report that the overall repair session completed successfully.
If you see a “new session” message for a particular ID but not a “session completed successfully”, the repair is still running.
If a repair doesn’t complete successfully after some time, you should look more closely at the other messages for that session to see where it might be stuck.
Sometimes network issues can disrupt the streaming of data or a merkle tree, causing repair to hang
Other times, there is simply a lot of data, and building merkle trees can take a long time, as can streaming data
Increasing compaction throughput and streaming throughput will help speed the process, at the cost of using extra I/O and network bandwidth.
Check the other nodes involved in the repair for messages using the same session ID
Check for any errors that would have disrupted the repair
Before we end, I just want to go back to the troubleshooting process I discussed at the beginning
Next time you have a problem, think about the tools at your disposal and how they can help you with these steps