Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Building Apache Cassandra clusters for massive scale
1. Building Apache Cassandra
clusters for massive scale
Covering theory and operational aspects of bring up
Apache Cassandra clusters - this presentation can be used
as a field reference.
Alex Thompson, Solution Architect APAC - DataStax Australia Pty Ltd
3. Build a best practice reproducible machine
image using automation:
Use one of the core test linux distros and versions: RHEL, CentOS or Ubuntu Server.
Select a cloud server or on-premise hardware that at least meets minimum specifications for Apache Cassandra, refer
to this guide for details: Planning Apache Cassandra Hardware
For production, load testing and production like workloads do NOT use a SAN, NAS, CEPH or any other type of shared
storage, DO use directly attached SSDs.
More RAM is better, more CPU is better but don’t get stuck in the RDBMS trap of vertically scaling, Apache Cassandra
works best with many more medium spec’d nodes than a smaller amount of very large nodes - think horizontal scaling
not vertical scaling.
3
4. Build a best practice reproducible machine
image using automation:
Use an automation tool like Ansible, Salt, Chef or Puppet to:
1. Apply Apache Cassandra OS specific settings for Linux
2. Install Java JDK 1.8.latest
3. Install but not start Apache Cassandra via yum or apt (a tarball is also available)
4. Copy over this nodes cassandra.yaml and cassandra-env.sh
5. Lock down all ports except the required Apache Cassandra ports in iptables, you can see a list of the ports and
their usage here: Securing Firewall but as a simple list you need access on 22 (SSH), 7000, 7001(SSL), 9042
(CQL), 9160(Thrift - optional) and 7199(JMX-optional)
Refer to the presentation by Jon from Macquarie Bank on the use of Ansible and lessons learned for an in depth
discussion on automation - November 2016 meetup.
4
5. Minimum node specific cassandra.yaml
fields for automation deployment scripts:
cluster_name All nodes participating in a cluster must have the identical cluster name.
hints_directory Where to store hints for other nodes that are down, small disk space requirement.
authenticator Used to identify users; default is wide open, lock this down in combination with transport layer security and
on disk encryption if internet exposed.
authorizer Used to limit access/provide permissions; default is wide open, lock this down in combination with transport
layer security and on disk encryption if internet exposed.
data_file_directories Where you will store data for this node, this will be the largest consumer of disk space. You should put your
commitlog_directory and data_file_directories on different drives for performance.
commitlog_directory You should put your commitlog_directory and data_file_directories on different drives for performance.
saved_caches_directory Where to store your “fast start-up” cache; small disk space requirement.
5
6. Minimum node specific cassandra.yaml
fields for automation deployment scripts:
seeds When bootstrapping a new node into a cluster, the bootstrapping node will refer to a seed node to learn
topology of the cluster, with this information it can take ownership of token ranges and begin data transfer.
listen_address The ip-address of the node for a single homed 1x NIC node.
rpc_address The ip-address of the node for a single homed 1x NIC node.
endpoint_snitch GossipingPropertyFileSnitch
1. The parameter list above is for a basic C* cluster leaving many unlisted parameters at their default settings, the
default settings are very sane for most use cases but can be fine tuned to maximize performance and hardware
utilisation, only tweak the unlisted parameters when you know what you are doing.
2. The parameters listed above are in top down order as at 13/2/2017 for the github.com master Apache Cassandra
repository here: cassandra.yaml
6
7. Minimum node specific cassandra-env.sh
fields for automation deployment scripts:
If the cassandra-env.sh is left in default form it will allocate ¼ of the RAM in the node to Apache Cassandra, this can be
problematic on very small spec’d nodes as C* really needs a minimum 4GB HEAP allocation to function in development.
As a general rule if HEAP =< 16GB use ParNew/CMS GC otherwise HEAP > 16GB use G1 GC.
You set the HEAP by uncommenting the following in the cassandra-env.sh:
#MAX_HEAP_SIZE="4G"
#HEAP_NEWSIZE="800M"
G1 requires that only MAX_HEAP_SIZE be set.
In production the HEAP setting on G1 GC are usually 16,24,32GB.
ParNew/CMS requires both are set, as a guide HEAP_NEWSIZE should be 20-25% of MAX_HEAP_SIZE.
7
8. Summary so far
We now have a node that:
1. Is on the correct hardware
2. Has correct OS with basic tuning in place
3. Has the correct Java JDK version
4. Has Apache Cassandra installed via yum or apt
5. Has customised cassandra.yaml and cassandra-env.sh files
6. Has been secured at IPtable level
7. Can now be started and bootstrapped against seed in the cluster
8
10. Bringing up the first node...
This is a new cluster when bring up the first node so there is in effect nothing to bootstrap against, Cassandra
understands this and initialises the node without going thru the bootstrapping phase.
>service cassandra start
Check /var/log/cassandra/system.log for startup process and monitor for any warnings or exceptions.
You most likely want to bring up multiple nodes at once in the new cluster, for the sake of this presentation I am
looking at one at a time so that i can break down the bootstrapping phases, to skip that and bring multiple nodes up at
once follow the documentation here:
Initializing a multiple node cluster (single datacenter)
10
11. Load some data
Load some data into the first node.
Here I am going to use the
cassandra-stress tool to load 100GB of
sample data.
Cassandra-stress can be used for
loading sample data and/or stress
testing a Cassandra cluster with read /
write workloads.
You can read more about
cassandra-stress here.
1
Tokens 0-9
Data on disk 100GB
11
12. Bootstrapping the second node...
Put the ip-address of the first node in the seed list of this node’s cassandra.yaml
>service cassandra start
Check /var/log/cassandra/system.log for bootstrapping progress.
12
13. Bootstrapping the second node...
Run the following on the first node and you will see your new node in UJ state - Up Joining:
>nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.10.3.62 100 GB 256 ? c934ced4-b1c9-4f0f-b278-83282cd7107f RAC2
UJ 10.10.3.63 3 MB 256 ? 1a3df7fa-a1e7-464a-9495-c6a52d61eafa RAC3
13
14. Bootstrapping...what happened?
So what is happening in this bootstrapping phase?
In Up Joining (UJ) state the node is not actively participating in any queries either read or write for both internode and
client to node traffic.
1. A calculation is done for this node’s share of the token space, in this case it takes half of the token space as it is
one of only two nodes in the ring and in taking half the token space it is taking responsibility for half the data in
the ring.
2. The node begins streaming in the data from the first node for its tokens.
3. The node completes streaming its data from the first node, this can take time for 100’s of GBs of data
4. The node changes state to UN (Up Normal)
5. The node can now be discovered by drivers and their application servers and now start responding to read /
write requests.
14
15. Data streaming
during bootstrap
Be aware on small clusters of the
costs of bootstrapping, the data
streaming phase can consume
considerable resources and take
increasing amounts of time for very
large amounts of data.
1
2
Tokens 0-4
Data on disk 100GB
Tokens 5-9
Data on disk ..growing
15
16. Second node
added
Notice that the second node now owns
half of the tokens in the ring.
Notice that the data on node 1 is
100GB on disk and the data on the
new node 2 is only 50GB on disk.
1
2
Tokens 0-4
Data on disk 100GB
Tokens 5-9
Data on disk 50GB
16
17. Bootstrapping data...WTF?
In bootstrapping the new node, I knew it took half the data off the first node but the amount of disk space used on the
first node didn’t change, it didn’t go down? WTF is going on here? Something is broken!
Rule: Bootstrapping a new node into a cluster does NOT clean up after itself and delete the orphaned data on the
original nodes!
Don’t get me wrong, the data on the first node is not hurting anything, it’s not used anymore, it just sits there using up
precious space, let's get rid of it by running the following command on the first node:
>nodetool cleanup
Note that in a Vnode cluster (most likely what you will be using) you have to run nodetool cleanup on all nodes in the
DC except of course the node you just added.
17
18. After cleanup
After [nodetool cleanup] has run data
is once again evenly distributed over
nodes.
1
2
Tokens 0-4
Data on disk 50GB
Tokens 5-9
Data on disk 50GB
18
19. Powerful
implications
We just doubled the raw compute
capacity of our database tier in the
following ways:
1. Doubled IO throughput
2. Doubled the amount of RAM
3. Doubled the amount of disk
4. Doubled the number of CPUs
1
2
Tokens 0-4
Data on disk 50GB
Tokens 5-9
Data on disk 50GB
19
20. Powerful
implications
The effect at the application tier is
arguably more profound, we have
doubled the workload capacity of the
underlying database tier to handle
increases in application tier traffic. So
as our workload increases at the
application tier we simply add nodes at
the Cassandra cluster level to soak up
the workload increase.
*The tps figures in this series are not real, your
tps limits will be dependent on your hardware,
data model, replication_factor and how you
read / write data. Use cassandra-stress to
emulate your real world traffic patterns and
and record performance behaviour.
1
Application server max tps 5000 tps
1000 tps
20
21. Powerful
implications
The effect at the application tier is
arguably more profound, we have
doubled the workload capacity of the
underlying database tier to handle
increases in application tier traffic. So
as our workload increases at the
application tier we simply add nodes at
the Cassandra cluster level to soak up
the workload increase.
1
2
1000 tps
1000 tps
Application server max tps 5000 tps
21
22. Practical
considerations
There is not much use having a two
node cluster, you really want a
minimum of 3 nodes and a
replication_factor of 3 and then scale
out your cluster from there.
1
23
22
23. Practical
considerations
Here we have stayed with a single
application server which is not a really
good idea from a redundancy
perspective but there is another
problem.
The tps capacity of the database tier
has scaled past the tps capacity of the
application tier, leaving the database
tier under-utilized.
1
5
2
3
4
8
6
7
9
9000 tps
Application server max tps 5000 tps
23
24. Practical
considerations
Time to start scaling out the
application tier to fully utilize the
capacity of the database tier.
1
5
2
3
4
8
6
7
9
9000 tps
Application server max tps 10000 tps
24
25. Triggers for adding more nodes and
capacity planning
Too much data per node You want to aim for 500GB-1TB of data per node, the more data per node the longer repairs,
bootstrapping and compactions take.
Insufficient free space on drives For SizeTieredCompactionStrategy (the default) you need 50% of the disk free at all times in the
worst case.
Poor IO performance If you have done everything right in regards to amount of data per node, have directly attached
SSD’s and have tuned both your hardware and Cassandra to maximize IO performance and you
still have poor IO performance then you need to scale out of the problem.
Bottlenecked CPUs Same as above, if you have done everything right and tuned both your hardware and Cassandra
to maximize CPU performance and you still have poor CPU performance then you need to scale
out of the problem.
25
26. Triggers for adding more nodes and
capacity planning
Poor JVM GC behaviour This can be tricky to troubleshoot, more than likely it’s just a scale out fix as you are
overloading the nodes with read / write traffic, but there are cases where a poor access pattern
or problematic use case can be the cause of GC churning.
Adding additional keyspaces and
application workloads to the cluster
Workloads are cumulative in resource demand.
Increases in application tier traffic If you double the amount of requests against your application tier, the relationship with
Cassandra is linear, you will need to double the number of nodes in your cluster to maintain
the same performance, it’s simple maths.
26
27. Summary so far
Now we have a basic cluster of 9 nodes that we can continue to scale out.
What we do not have is any form of redundancy:
1. What if a shared switch goes down?
2. What if a common rack chassis power supply goes down?
3. What if we loose the network to this physical data center?
Cassandra has probably the best answer to this of any DB solution available: the logical data center.
27
29. cluster
Data centers
Cassandra data centers (DCs) are a
logical not physical concept.
A Cassandra cluster is made up of
data centers and each data center
holds a complete token range.
You write your data to one data center
and it is replicated to another
datacenter, that other data center
could be in the same rack or across
the world.
A cluster can have many data centers
but practical limits do apply.
DC1
1
5
2
3
4
8
6
7
9
DC2
1
5
2
3
4
8
6
7
9
29
30. cluster
Data centers
Data centers are a versatile concept
and can be used for many differing
purposes, here are some examples:
1. Simple redundancy
2. Active failover from app tier
3. Geo edge serving
4. Workload isolation
As mentioned before, each DC holds
complete token range for the
keyspaces that are replicated to it, you
decide which keyspaces are
replicated.
DC1
1
5
2
3
4
8
6
7
9
DC2
1
5
2
3
4
8
6
7
9
CREATE KEYSPACE myKeyspace
WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}
30
31. cluster
Simple redundancy
This multi-dc cluster is a simple
redundancy setup, if we lose us-east-1
due to an outage we can access
us-west-1 for the data for business
continuity.
us-east-1
1
5
2
3
4
8
6
7
9
us-west-1
1
5
2
3
4
8
6
7
9
read/write DC
CREATE KEYSPACE myKeyspace
WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east-1: '3', 'us-west-1': '3'}
31
32. cluster
Active failover
This multi-dc cluster is a an active
failover setup, if we lose us-east-1 due
to an outage we can failover the
application servers to us-west-1, this
can be configured at the cassandra
driver level*, in custom code, the
network layer or at the DNS level.
* See the April 2016 Sydney Cassandra Users
Meetup talk that covers most aspects of driver
configuration and strategies.
us-east-1
1
5
2
3
4
8
6
7
9
us-west-1
1
5
2
3
4
8
6
7
9
read/write DC actively fails over to the us-west-1 DC
CREATE KEYSPACE myKeyspace
WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east-1: '3', 'us-west-1': '3'}
32
33. cluster
Geo edge serving
All DC’s are close to their own
in-country app servers.
Writes can be handled in any number
of ways, reads are always from the
closest DC.
Any write to any DC replicates to the
other 3 geographic locations.
US-DC
1
5
2
3
4
8
6
7
9
CREATE KEYSPACE myKeyspace
WITH replication =
{'class': 'NetworkTopologyStrategy', 'US-DC: '3', 'EU-DC': '3',, 'ME-DC': '3', 'AP-DC': '3'}
EU-DC
1
5
2
3
4
8
6
7
9
ME-DC
1
5
2
3
4
8
6
7
9
AP-DC
1
5
2
3
4
8
6
7
9
33
35. cluster
Workload isolation
Apart from simple redundancy this is the most
important use of logical data centers in
Cassandra.
Different workloads are pointed to different
data centers to allow us to isolate say a spiky
web workload from an analytic Spark
workload, we can then independently scale
each DC to its own workload making the most
efficient use of resources.
In this example we replicate cass-DC tables to
spark-DC, perform analytics on them and write
to recommendation tables in the spark-DC
which replicate back to the cass-DC.
cass-DC
1
5
2
3
4
8
6
7
9
spark-DC
1
5
2
3
4
8
6
7
9
app server
CREATE KEYSPACE web-tables
WITH replication = {'class': 'NetworkTopologyStrategy', 'cass-DC: '3', 'spark-DC': '2'}
CREATE KEYSPACE recommendation-tables
WITH replication = {'class': 'NetworkTopologyStrategy', 'spark-DC: '2', 'cass-DC': '3'}
spark
35
36. C* Learning resources
The datastax documentation has more extensive descriptions of all the concepts listed here, please
refer to it if you need more in depth knowledge and don’t forget academy.datastax.com for full
courses and a multitude of Apache Cassandra learning resources.
36