SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Scaling Open Source Big
Data Cloud Applications is
Easy/Hard
Paul Brebner
Instaclustr—Technology Evangelist
©Instaclustr Pty Limited, 2022
HotCloudPerf - 9 April 2022
Who am I?
• 1999-2007 CSIRO/UCL
• Enterprise Java evaluation, SPECJAppServer200X
benchmarks
• OGSA (Grid) evaluation (UCL)
• 2007-2017 NICTA/Startup CTO
• R&D and consulting - performance modelling large scale
distributed systems, for government and large enterprises
• 2017-present Instaclustr (soon NetApp)
• Technology Evangelist for Instaclustr
• 100+ Blogs, Talks
Performance modelling from
APM data (2007-2017, NICTA/Startup)
APM data
(Dynatrace’s
PurePath®)
Performance
Model
Simulation Tool
+ Model
Visualisation
and Graphs
Transformation Execution
Automated pipeline, worked with Dynatrace PurePath® data
Distributed traces with breakdown data (resource types and time)
Revisit with OpenTelemetry?
Cloud Platform for Big Data
Open Source Technologies
New! And Workflows
Instaclustr Managed Platform
©Instaclustr Pty Limited, 2021
Scaling is Easy! Cassandra and Kafka
Homogeneous distributed clusters à horizontally scalable
www.cassandra.apache.org/_/cassandra-basics.html
But actually lots of moving parts
(source: http://trumpetb.net/loco/rodsf.html)
Complications – DCs, Racks, Nodes, Partitions,
Replication Factor, Time (for auto-scaling)
Rows have a
partition key
and are
stored in
different
partitions
Example 1 – Cassandra Auto-Scaling
©Instaclustr Pty Limited, 2021
Two Ways of Resizing Clusters
1 - Horizontal Scaling
• Add nodes, no interruption
• But scale up only (not down)
• Takes time, puts extra load on cluster as data streams to extra nodes
2 - Vertical Scaling
• Replace nodes with bigger (or smaller) node types (more/less cores)
• Scale up and down
• Takes time, temporary reduction in capacity
• Choice of how many nodes are replaced concurrently – by “node” (1 node at a
time) or by “rack” (all nodes in a rack) , or in-between
Cluster resizing time – by node vs. by
rack – by rack is faster but …?
Cluster = 6 nodes, 3 racks, 2 nodes per rack
By node (concurrency 1)
By rack (concurrency 2)
Resizing by node – capacity reduced by 1/6 total
nodes each resize operation (simplified model)
Resizing by rack – capacity reduced by 2/6
nodes each resize operation
Comparison – resize by rack faster but has
bigger capacity hit during resize
Observations
• In both cases
• The eventual capacity is double the original
• The cluster capacity is reduced during resizing
• By rack is faster, but has worst capacity reduction during resizing
• By node is slower, but has less capacity reduction during resizing
• If the capacity during resize is exceeded latencies will increase
• Made worse by Cassandra load balancing which assumes equal sized
nodes
• By node, more nodes in the Cluster reduces the impact of reduced cluster
capacity during resizing (some clusters have 100s of nodes)
• But majority of our clusters have <= 6 nodes
Auto-scaling model - increasing load à linear
regression over 1 hour extrapolated to future
We predict the cluster will reach
100% capacity around the 280
minute mark (220 minutes in the
future)
Extrapolated
Measured
Resize by Rack vs. Node - initiated in time to
prevent overloading during resize operation
Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity
during resize (67% c.f. 83% of initial capacity)
By
Rack
By Node
Auto-scaling POC – worked!
Monitoring API
Linear Regression +
Rules
Provisioning API
Rules generalized to allow for
• scaling up and down
• resizing by any number of nodes concurrently, up to rack size
Example 2– Kafka Topics
“Kongo” Logistics IoT Application
©Instaclustr Pty Limited, 2021
©Instaclustr Pty
Limited 2019,
2021, 2022
Kafka is a topic based pub-sub messaging system:
- Producers send messages to topics.
- Consumers subscribe to topics of interest, e.g. parties.
- When they poll they only receive messages sent to those topics.
Producer
Consumer
Consumer
Consumer
Consumer
Topic “Parties”
Topic “Work”
Consumers subscribed
to Topic “Parties”
Consumers poll to
receive messages
from “Parties”
Consumers not subscribed to
“Work” messages
©Instaclustr Pty Limited 2019, 2021, 2022
Design choices: Many vs. One Topic?
1 - Many (100s) of topics
100s of locations (Warehouses, Trucks)
Each location has a topic and multiple
consumers
2 - One topic
One topic for all locations
Single topic wins! But why?
Many topics, 7200
Single topic, 1120000
0
200000
400000
600000
800000
1000000
1200000
Many topics Single topic
TPS
Example 3 – Anomaly Detection
©Instaclustr Pty Limited, 2021
JoAnn Morgan Apollo 11 Mission Control
Multiple technologies: Kafka,
Cassandra, Kubernetes
Massively Scalable Anomaly Detection
– Tuning knobs (Orange h/w, yellow s/w)
Scaling is (too) Easy!
Initially just increased h/w resources
But scalability not great
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Tuning required! Scalability Post-tuning
0
2
4
6
8
10
12
14
16
18
20
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
Tuning – Optimize s/w resources
(red arrows)
1
2
3
1. Minimize Kafka Consumers (thread pool 1)
2. Minimize Cassandra Connections
3. Maximize Cassandra client concurrency (thread
pool 2)
Example 4 – What’s really going on -
behind the Kafka partitions?
©Instaclustr Pty Limited, 2021
©Instaclustr Pty
Limited 2019,
2021, 2022
Kafka topic partitions enable
consumer concurrency
partitions >= consumers
Partition n
Topic “Parties”
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Consumer
Consumers share
work within groups
Consumer
Fan out requires many consumers and
partitions
Can be caused by:
1 Design – many topics and/or many consumers (Kongo
Example)
2 Little’s Law (Anomaly Detection Example)
Concurrency (Consumers) = Time x Throughput
Slow consumers requires more of them to keep up target
throughput – having 2 thread pools helped for Anomaly
Detection example to reduce the consumer time and count
Kafka write architecture – partition
replication
Benchmarking revealed that partitions
and replication factor are the culprit
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 10 100 1000 10000
TPS
Partitions
Kafka Partitions vs. Throughput
Cluster: 3 nodes x 4 cores = 12 cores total
Replication Factor 3 (TPS) Replication Factor 1 (TPS)
Implications?
• Bigger Cluster (more nodes, bigger nodes)
• Design to minimize topics and consumers
• Optimize consumers for minimum time
• Always benchmark with many partitions
• Blame the Apache Zookeeper?
• Responsible for Kafka control
• From version 3.0 it’s being replaced by native KRaft protocol
• May enable more partitions
Example 5 – Data Pipeline – multiple technologies
Kafka Connect, Elasticsearch, PostgreSQL etc
©Instaclustr Pty Limited, 2021
(Source: Paul Brebner)
See Blogs
©Instaclustr Pty Limited, 2020
àApacheCon Open Source Performance
Engineering Track
à www.linkedin.com/pulse/call-papers-performance-engineering-track-
apachecon-paul-brebner/
+
Bullet Train à Speed and Scale
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
For further Information see blogs www.instaclustr.com/paul-brebner/

Mais conteúdo relacionado

Semelhante a Scaling Open Source Big Data Cloud Applications is Easy/Hard

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsPanagiotis Papadopoulos
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAmazon Web Services
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData Inc
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudAmazon Web Services
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAmazon Web Services
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Deepak Shankar
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioKickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioHostedbyConfluent
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5Peter Lawrey
 

Semelhante a Scaling Open Source Big Data Cloud Applications is Easy/Hard (20)

Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data Analytics
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Kafka vs kinesis
Kafka vs kinesisKafka vs kinesis
Kafka vs kinesis
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
AWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWSAWS Webcast - An Introduction to High Performance Computing on AWS
AWS Webcast - An Introduction to High Performance Computing on AWS
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioKickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 

Mais de Paul Brebner

Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersPaul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Paul Brebner
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaPaul Brebner
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialPaul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980'sPaul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...Paul Brebner
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Paul Brebner
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Paul Brebner
 

Mais de Paul Brebner (20)

Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache Kafka
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache Kafka
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and Potential
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...
 

Último

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Último (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Scaling Open Source Big Data Cloud Applications is Easy/Hard

  • 1. Scaling Open Source Big Data Cloud Applications is Easy/Hard Paul Brebner Instaclustr—Technology Evangelist ©Instaclustr Pty Limited, 2022 HotCloudPerf - 9 April 2022
  • 2. Who am I? • 1999-2007 CSIRO/UCL • Enterprise Java evaluation, SPECJAppServer200X benchmarks • OGSA (Grid) evaluation (UCL) • 2007-2017 NICTA/Startup CTO • R&D and consulting - performance modelling large scale distributed systems, for government and large enterprises • 2017-present Instaclustr (soon NetApp) • Technology Evangelist for Instaclustr • 100+ Blogs, Talks
  • 3. Performance modelling from APM data (2007-2017, NICTA/Startup) APM data (Dynatrace’s PurePath®) Performance Model Simulation Tool + Model Visualisation and Graphs Transformation Execution Automated pipeline, worked with Dynatrace PurePath® data Distributed traces with breakdown data (resource types and time) Revisit with OpenTelemetry?
  • 4. Cloud Platform for Big Data Open Source Technologies New! And Workflows Instaclustr Managed Platform ©Instaclustr Pty Limited, 2021
  • 5. Scaling is Easy! Cassandra and Kafka Homogeneous distributed clusters à horizontally scalable www.cassandra.apache.org/_/cassandra-basics.html
  • 6. But actually lots of moving parts (source: http://trumpetb.net/loco/rodsf.html)
  • 7. Complications – DCs, Racks, Nodes, Partitions, Replication Factor, Time (for auto-scaling) Rows have a partition key and are stored in different partitions
  • 8. Example 1 – Cassandra Auto-Scaling ©Instaclustr Pty Limited, 2021
  • 9. Two Ways of Resizing Clusters 1 - Horizontal Scaling • Add nodes, no interruption • But scale up only (not down) • Takes time, puts extra load on cluster as data streams to extra nodes 2 - Vertical Scaling • Replace nodes with bigger (or smaller) node types (more/less cores) • Scale up and down • Takes time, temporary reduction in capacity • Choice of how many nodes are replaced concurrently – by “node” (1 node at a time) or by “rack” (all nodes in a rack) , or in-between
  • 10. Cluster resizing time – by node vs. by rack – by rack is faster but …? Cluster = 6 nodes, 3 racks, 2 nodes per rack By node (concurrency 1) By rack (concurrency 2)
  • 11. Resizing by node – capacity reduced by 1/6 total nodes each resize operation (simplified model)
  • 12. Resizing by rack – capacity reduced by 2/6 nodes each resize operation
  • 13. Comparison – resize by rack faster but has bigger capacity hit during resize
  • 14. Observations • In both cases • The eventual capacity is double the original • The cluster capacity is reduced during resizing • By rack is faster, but has worst capacity reduction during resizing • By node is slower, but has less capacity reduction during resizing • If the capacity during resize is exceeded latencies will increase • Made worse by Cassandra load balancing which assumes equal sized nodes • By node, more nodes in the Cluster reduces the impact of reduced cluster capacity during resizing (some clusters have 100s of nodes) • But majority of our clusters have <= 6 nodes
  • 15. Auto-scaling model - increasing load à linear regression over 1 hour extrapolated to future We predict the cluster will reach 100% capacity around the 280 minute mark (220 minutes in the future) Extrapolated Measured
  • 16. Resize by Rack vs. Node - initiated in time to prevent overloading during resize operation Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity during resize (67% c.f. 83% of initial capacity) By Rack By Node
  • 17. Auto-scaling POC – worked! Monitoring API Linear Regression + Rules Provisioning API Rules generalized to allow for • scaling up and down • resizing by any number of nodes concurrently, up to rack size
  • 18. Example 2– Kafka Topics “Kongo” Logistics IoT Application ©Instaclustr Pty Limited, 2021 ©Instaclustr Pty Limited 2019, 2021, 2022
  • 19. Kafka is a topic based pub-sub messaging system: - Producers send messages to topics. - Consumers subscribe to topics of interest, e.g. parties. - When they poll they only receive messages sent to those topics. Producer Consumer Consumer Consumer Consumer Topic “Parties” Topic “Work” Consumers subscribed to Topic “Parties” Consumers poll to receive messages from “Parties” Consumers not subscribed to “Work” messages ©Instaclustr Pty Limited 2019, 2021, 2022
  • 20. Design choices: Many vs. One Topic? 1 - Many (100s) of topics 100s of locations (Warehouses, Trucks) Each location has a topic and multiple consumers
  • 21. 2 - One topic One topic for all locations
  • 22. Single topic wins! But why? Many topics, 7200 Single topic, 1120000 0 200000 400000 600000 800000 1000000 1200000 Many topics Single topic TPS
  • 23. Example 3 – Anomaly Detection ©Instaclustr Pty Limited, 2021 JoAnn Morgan Apollo 11 Mission Control
  • 25. Massively Scalable Anomaly Detection – Tuning knobs (Orange h/w, yellow s/w) Scaling is (too) Easy! Initially just increased h/w resources
  • 26. But scalability not great 0 1 2 3 4 5 6 7 8 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of checks/day (pre-tuning)
  • 27. Tuning required! Scalability Post-tuning 0 2 4 6 8 10 12 14 16 18 20 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of checks/day (pre-tuning) Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
  • 28. Tuning – Optimize s/w resources (red arrows) 1 2 3 1. Minimize Kafka Consumers (thread pool 1) 2. Minimize Cassandra Connections 3. Maximize Cassandra client concurrency (thread pool 2)
  • 29. Example 4 – What’s really going on - behind the Kafka partitions? ©Instaclustr Pty Limited, 2021 ©Instaclustr Pty Limited 2019, 2021, 2022
  • 30. Kafka topic partitions enable consumer concurrency partitions >= consumers Partition n Topic “Parties” Partition 1 Producer Partition 2 Consumer Group Consumer Consumer Consumers share work within groups Consumer
  • 31. Fan out requires many consumers and partitions Can be caused by: 1 Design – many topics and/or many consumers (Kongo Example) 2 Little’s Law (Anomaly Detection Example) Concurrency (Consumers) = Time x Throughput Slow consumers requires more of them to keep up target throughput – having 2 thread pools helped for Anomaly Detection example to reduce the consumer time and count
  • 32. Kafka write architecture – partition replication
  • 33. Benchmarking revealed that partitions and replication factor are the culprit 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1 10 100 1000 10000 TPS Partitions Kafka Partitions vs. Throughput Cluster: 3 nodes x 4 cores = 12 cores total Replication Factor 3 (TPS) Replication Factor 1 (TPS)
  • 34. Implications? • Bigger Cluster (more nodes, bigger nodes) • Design to minimize topics and consumers • Optimize consumers for minimum time • Always benchmark with many partitions • Blame the Apache Zookeeper? • Responsible for Kafka control • From version 3.0 it’s being replaced by native KRaft protocol • May enable more partitions
  • 35. Example 5 – Data Pipeline – multiple technologies Kafka Connect, Elasticsearch, PostgreSQL etc ©Instaclustr Pty Limited, 2021 (Source: Paul Brebner) See Blogs
  • 36. ©Instaclustr Pty Limited, 2020 àApacheCon Open Source Performance Engineering Track à www.linkedin.com/pulse/call-papers-performance-engineering-track- apachecon-paul-brebner/ + Bullet Train à Speed and Scale