In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
Invited keynote for 5th Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2022) https://hotcloudperf.spec.org/ at ICPE 2022 https://icpe2022.spec.org/
2. Who am I?
• 1999-2007 CSIRO/UCL
• Enterprise Java evaluation, SPECJAppServer200X
benchmarks
• OGSA (Grid) evaluation (UCL)
• 2007-2017 NICTA/Startup CTO
• R&D and consulting - performance modelling large scale
distributed systems, for government and large enterprises
• 2017-present Instaclustr (soon NetApp)
• Technology Evangelist for Instaclustr
• 100+ Blogs, Talks
3. Performance modelling from
APM data (2007-2017, NICTA/Startup)
APM data
(Dynatrace’s
PurePath®)
Performance
Model
Simulation Tool
+ Model
Visualisation
and Graphs
Transformation Execution
Automated pipeline, worked with Dynatrace PurePath® data
Distributed traces with breakdown data (resource types and time)
Revisit with OpenTelemetry?
5. Scaling is Easy! Cassandra and Kafka
Homogeneous distributed clusters à horizontally scalable
www.cassandra.apache.org/_/cassandra-basics.html
6. But actually lots of moving parts
(source: http://trumpetb.net/loco/rodsf.html)
7. Complications – DCs, Racks, Nodes, Partitions,
Replication Factor, Time (for auto-scaling)
Rows have a
partition key
and are
stored in
different
partitions
9. Two Ways of Resizing Clusters
1 - Horizontal Scaling
• Add nodes, no interruption
• But scale up only (not down)
• Takes time, puts extra load on cluster as data streams to extra nodes
2 - Vertical Scaling
• Replace nodes with bigger (or smaller) node types (more/less cores)
• Scale up and down
• Takes time, temporary reduction in capacity
• Choice of how many nodes are replaced concurrently – by “node” (1 node at a
time) or by “rack” (all nodes in a rack) , or in-between
10. Cluster resizing time – by node vs. by
rack – by rack is faster but …?
Cluster = 6 nodes, 3 racks, 2 nodes per rack
By node (concurrency 1)
By rack (concurrency 2)
11. Resizing by node – capacity reduced by 1/6 total
nodes each resize operation (simplified model)
12. Resizing by rack – capacity reduced by 2/6
nodes each resize operation
13. Comparison – resize by rack faster but has
bigger capacity hit during resize
14. Observations
• In both cases
• The eventual capacity is double the original
• The cluster capacity is reduced during resizing
• By rack is faster, but has worst capacity reduction during resizing
• By node is slower, but has less capacity reduction during resizing
• If the capacity during resize is exceeded latencies will increase
• Made worse by Cassandra load balancing which assumes equal sized
nodes
• By node, more nodes in the Cluster reduces the impact of reduced cluster
capacity during resizing (some clusters have 100s of nodes)
• But majority of our clusters have <= 6 nodes
15. Auto-scaling model - increasing load à linear
regression over 1 hour extrapolated to future
We predict the cluster will reach
100% capacity around the 280
minute mark (220 minutes in the
future)
Extrapolated
Measured
16. Resize by Rack vs. Node - initiated in time to
prevent overloading during resize operation
Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity
during resize (67% c.f. 83% of initial capacity)
By
Rack
By Node
17. Auto-scaling POC – worked!
Monitoring API
Linear Regression +
Rules
Provisioning API
Rules generalized to allow for
• scaling up and down
• resizing by any number of nodes concurrently, up to rack size
20. Design choices: Many vs. One Topic?
1 - Many (100s) of topics
100s of locations (Warehouses, Trucks)
Each location has a topic and multiple
consumers
26. But scalability not great
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
27. Tuning required! Scalability Post-tuning
0
2
4
6
8
10
12
14
16
18
20
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
30. Kafka topic partitions enable
consumer concurrency
partitions >= consumers
Partition n
Topic “Parties”
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Consumer
Consumers share
work within groups
Consumer
31. Fan out requires many consumers and
partitions
Can be caused by:
1 Design – many topics and/or many consumers (Kongo
Example)
2 Little’s Law (Anomaly Detection Example)
Concurrency (Consumers) = Time x Throughput
Slow consumers requires more of them to keep up target
throughput – having 2 thread pools helped for Anomaly
Detection example to reduce the consumer time and count
33. Benchmarking revealed that partitions
and replication factor are the culprit
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 10 100 1000 10000
TPS
Partitions
Kafka Partitions vs. Throughput
Cluster: 3 nodes x 4 cores = 12 cores total
Replication Factor 3 (TPS) Replication Factor 1 (TPS)
34. Implications?
• Bigger Cluster (more nodes, bigger nodes)
• Design to minimize topics and consumers
• Optimize consumers for minimum time
• Always benchmark with many partitions
• Blame the Apache Zookeeper?
• Responsible for Kafka control
• From version 3.0 it’s being replaced by native KRaft protocol
• May enable more partitions