SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Scaling Open Source Big
Data Cloud Applications is
Easy/Hard
Paul Brebner
Instaclustr—Technology Evangelist
©Instaclustr Pty Limited, 2022
DeveloperWeek 10 May 2022
Who am I?
• Previously
• R&D in distributed systems and performance engineering.
• Last 5 years
• Technology Evangelist for Instaclustr (soon NetApp)
• 100+ Blogs, demo applications, talks
• Open Source technologies including
• Apache Cassandra, Spark, Kafka, Zookeeper
• and Redis, OpenSearch, PostgreSQL, Kubernetes,
Prometheus, OpenTracing, etc
Cloud Platform for Big Data
Open Source Technologies
Latest addition is
Workflows with Uber’s
Instaclustr Managed Platform
©Instaclustr Pty Limited, 2021
Cloud Platform for Big Data
Open Source Technologies
Latest addition is
Workflows with Uber’s
This talk focuses on Cassandra and Kafka
©Instaclustr Pty Limited, 2021
Scaling is Easy! Cassandra and Kafka
Homogeneous distributed clusters à horizontally scalable
www.cassandra.apache.org/_/cassandra-basics.html
But actually lots of moving parts
(source: http://trumpetb.net/loco/rodsf.html)
Complications – DCs, Racks, Nodes, Partitions,
Replication Factor, Time (for auto-scaling)
Rows have a
partition key
and are
stored in
different
partitions
Example 1 – Cassandra Auto-Scaling
©Instaclustr Pty Limited, 2021
Two Ways of Resizing Clusters
1 - Horizontal Scaling
• Add nodes, no interruption
• But scale up only (not down)
• Takes time, puts extra load on cluster as data streams to extra nodes
2 - Vertical Scaling
• Replace nodes with bigger (or smaller) node types (more/less cores)
• Scale up and down
• Takes time, temporary reduction in capacity
• Choice of how many nodes are replaced concurrently – by “node” (1 node at a
time) or by “rack” (all nodes in a rack) , or in-between
Cluster resizing time – by node vs. by
rack – by rack is faster but …?
Cluster = 6 nodes, 3 racks, 2 nodes per rack
By node (concurrency 1)
By rack (concurrency 2)
Resizing by node – capacity reduced by 1/6 total
nodes each resize operation (simplified model)
Resizing by rack – capacity reduced by 2/6
nodes each resize operation
Comparison – resize by rack faster but has
bigger capacity hit during resize
Observations
• If the capacity during resize is exceeded latencies will increase
• Made worse by Cassandra load balancing which assumes equal sized
nodes
• By node, more nodes in the Cluster reduces the impact of reduced cluster
capacity during resizing (some clusters have 100s of nodes) – but will take
longer
• Many of our clusters have <= 6 nodes
Auto-scaling model - increasing load à linear
regression over 1 hour extrapolated to future
We predict the cluster will reach
100% capacity around the 280
minute mark (220 minutes in the
future)
Extrapolated
Measured
Resize by Rack vs. Node - initiated in time to
prevent overloading during resize operation
Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity
during resize (67% c.f. 83% of initial capacity)
By
Rack
By Node
Auto-scaling POC – worked!
Monitoring API
Linear Regression +
Rules
Provisioning API
Rules generalized to allow for
• scaling up and down
• resizing by any number of nodes concurrently, up to rack size
Example 2 – Anomaly Detection
©Instaclustr Pty Limited, 2021
JoAnn Morgan Apollo 11 Mission Control
Multiple technologies: Kafka,
Cassandra, Kubernetes
Massively Scalable Anomaly Detection
– Tuning knobs (Orange h/w, yellow s/w)
Scaling is (too) Easy!
Initially just increased h/w resources
But scalability not great
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Tuning required! Scalability Post-tuning
0
2
4
6
8
10
12
14
16
18
20
0 100 200 300 400 500 600 700
Billions
checks/day
Total Cores
Total Cores vs. Billions of checks/day (pre-tuning)
Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
Tuning – Optimize s/w resources
(red arrows)
1
2
3
1. Minimize Kafka Consumers (thread pool 1)
2. Minimize Cassandra Connections
3. Maximize Cassandra client concurrency (thread
pool 2)
Example 3 – What’s really going on -
behind the Kafka partitions?
©Instaclustr Pty Limited, 2021
©Instaclustr Pty
Limited 2019,
2021, 2022
Kafka topic partitions enable
consumer concurrency
partitions >= consumers
Partition n
Topic “Parties”
Partition 1
Producer
Partition 2
Consumer Group
Consumer
Consumer
Consumers share
work within groups
Consumer
High consumer/partition fan out
Can be caused by:
1 Design – many topics and/or many consumers
2 Slow consumers à need more consumers to increase
throughput
Kafka write architecture – partition
replication
Benchmarking revealed that partitions
and replication factor are the culprit
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 10 100 1000 10000
TPS
Partitions
Kafka Partitions vs. Throughput
Cluster: 3 nodes x 4 cores = 12 cores total
Replication Factor 3 (TPS) Replication Factor 1 (TPS)
Implications?
• Bigger Cluster (more nodes, bigger nodes)
• Design to minimize topics and consumers
• Optimize consumers for minimum time
• Always benchmark with many partitions
• Blame the Apache Zookeeper?
• Responsible for Kafka control
• From version 3.0 it’s being replaced by native KRaft protocol
• Not yet production ready
• May enable more partitions (but may not impact throughput)
Scaling is Mostly Easy!
§ Using Scalable Open Source Big Data Technologies
§ Hosted by suitable Cloud providers
§ With suitable monitoring, understanding of autoscaling
and how different software “knobs” interact, and by
scaling incrementally
© Instaclustr Pty Limited, 2022
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
For further Information see blogs www.instaclustr.com/paul-brebner/

Mais conteúdo relacionado

Semelhante a OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard

C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...DataStax Academy
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData Inc
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011sandeep_tata
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelinesSumant Tambe
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafkaconfluent
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsPanagiotis Papadopoulos
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure bloomreacheng
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLEdunomica
 

Semelhante a OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard (20)

C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Kafka vs kinesis
Kafka vs kinesisKafka vs kinesis
Kafka vs kinesis
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
 

Mais de Paul Brebner

Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersPaul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Paul Brebner
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaPaul Brebner
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialPaul Brebner
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980'sPaul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...Paul Brebner
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Paul Brebner
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Paul Brebner
 

Mais de Paul Brebner (20)

Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining PhilosophersApache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache KafkaSpinning your Drones with Cadence Workflows and Apache Kafka
Spinning your Drones with Cadence Workflows and Apache Kafka
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
 
A Visual Introduction to Apache Kafka
A Visual Introduction to Apache KafkaA Visual Introduction to Apache Kafka
A Visual Introduction to Apache Kafka
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
 
Grid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and PotentialGrid Middleware – Principles, Practice and Potential
Grid Middleware – Principles, Practice and Potential
 
Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...Grid middleware is easy to install, configure, secure, debug and manage acros...
Grid middleware is easy to install, configure, secure, debug and manage acros...
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's0b101000 years of computing: a personal timeline - decade "0", the 1980's
0b101000 years of computing: a personal timeline - decade "0", the 1980's
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...Automatic Performance Modelling from Application Performance Management (APM)...
Automatic Performance Modelling from Application Performance Management (APM)...
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...
 

Último

Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Último (20)

Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard

  • 1. Scaling Open Source Big Data Cloud Applications is Easy/Hard Paul Brebner Instaclustr—Technology Evangelist ©Instaclustr Pty Limited, 2022 DeveloperWeek 10 May 2022
  • 2. Who am I? • Previously • R&D in distributed systems and performance engineering. • Last 5 years • Technology Evangelist for Instaclustr (soon NetApp) • 100+ Blogs, demo applications, talks • Open Source technologies including • Apache Cassandra, Spark, Kafka, Zookeeper • and Redis, OpenSearch, PostgreSQL, Kubernetes, Prometheus, OpenTracing, etc
  • 3. Cloud Platform for Big Data Open Source Technologies Latest addition is Workflows with Uber’s Instaclustr Managed Platform ©Instaclustr Pty Limited, 2021
  • 4. Cloud Platform for Big Data Open Source Technologies Latest addition is Workflows with Uber’s This talk focuses on Cassandra and Kafka ©Instaclustr Pty Limited, 2021
  • 5. Scaling is Easy! Cassandra and Kafka Homogeneous distributed clusters à horizontally scalable www.cassandra.apache.org/_/cassandra-basics.html
  • 6. But actually lots of moving parts (source: http://trumpetb.net/loco/rodsf.html)
  • 7. Complications – DCs, Racks, Nodes, Partitions, Replication Factor, Time (for auto-scaling) Rows have a partition key and are stored in different partitions
  • 8. Example 1 – Cassandra Auto-Scaling ©Instaclustr Pty Limited, 2021
  • 9. Two Ways of Resizing Clusters 1 - Horizontal Scaling • Add nodes, no interruption • But scale up only (not down) • Takes time, puts extra load on cluster as data streams to extra nodes 2 - Vertical Scaling • Replace nodes with bigger (or smaller) node types (more/less cores) • Scale up and down • Takes time, temporary reduction in capacity • Choice of how many nodes are replaced concurrently – by “node” (1 node at a time) or by “rack” (all nodes in a rack) , or in-between
  • 10. Cluster resizing time – by node vs. by rack – by rack is faster but …? Cluster = 6 nodes, 3 racks, 2 nodes per rack By node (concurrency 1) By rack (concurrency 2)
  • 11. Resizing by node – capacity reduced by 1/6 total nodes each resize operation (simplified model)
  • 12. Resizing by rack – capacity reduced by 2/6 nodes each resize operation
  • 13. Comparison – resize by rack faster but has bigger capacity hit during resize
  • 14. Observations • If the capacity during resize is exceeded latencies will increase • Made worse by Cassandra load balancing which assumes equal sized nodes • By node, more nodes in the Cluster reduces the impact of reduced cluster capacity during resizing (some clusters have 100s of nodes) – but will take longer • Many of our clusters have <= 6 nodes
  • 15. Auto-scaling model - increasing load à linear regression over 1 hour extrapolated to future We predict the cluster will reach 100% capacity around the 280 minute mark (220 minutes in the future) Extrapolated Measured
  • 16. Resize by Rack vs. Node - initiated in time to prevent overloading during resize operation Resize by rack must be initiated sooner c.f. resize by node, even thought it’s faster to resize, as it has less capacity during resize (67% c.f. 83% of initial capacity) By Rack By Node
  • 17. Auto-scaling POC – worked! Monitoring API Linear Regression + Rules Provisioning API Rules generalized to allow for • scaling up and down • resizing by any number of nodes concurrently, up to rack size
  • 18. Example 2 – Anomaly Detection ©Instaclustr Pty Limited, 2021 JoAnn Morgan Apollo 11 Mission Control
  • 20. Massively Scalable Anomaly Detection – Tuning knobs (Orange h/w, yellow s/w) Scaling is (too) Easy! Initially just increased h/w resources
  • 21. But scalability not great 0 1 2 3 4 5 6 7 8 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of checks/day (pre-tuning)
  • 22. Tuning required! Scalability Post-tuning 0 2 4 6 8 10 12 14 16 18 20 0 100 200 300 400 500 600 700 Billions checks/day Total Cores Total Cores vs. Billions of checks/day (pre-tuning) Billions of checks/day (pre-tuning) Billions of checks/day (post-tuning)
  • 23. Tuning – Optimize s/w resources (red arrows) 1 2 3 1. Minimize Kafka Consumers (thread pool 1) 2. Minimize Cassandra Connections 3. Maximize Cassandra client concurrency (thread pool 2)
  • 24. Example 3 – What’s really going on - behind the Kafka partitions? ©Instaclustr Pty Limited, 2021 ©Instaclustr Pty Limited 2019, 2021, 2022
  • 25. Kafka topic partitions enable consumer concurrency partitions >= consumers Partition n Topic “Parties” Partition 1 Producer Partition 2 Consumer Group Consumer Consumer Consumers share work within groups Consumer
  • 26. High consumer/partition fan out Can be caused by: 1 Design – many topics and/or many consumers 2 Slow consumers à need more consumers to increase throughput
  • 27. Kafka write architecture – partition replication
  • 28. Benchmarking revealed that partitions and replication factor are the culprit 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1 10 100 1000 10000 TPS Partitions Kafka Partitions vs. Throughput Cluster: 3 nodes x 4 cores = 12 cores total Replication Factor 3 (TPS) Replication Factor 1 (TPS)
  • 29. Implications? • Bigger Cluster (more nodes, bigger nodes) • Design to minimize topics and consumers • Optimize consumers for minimum time • Always benchmark with many partitions • Blame the Apache Zookeeper? • Responsible for Kafka control • From version 3.0 it’s being replaced by native KRaft protocol • Not yet production ready • May enable more partitions (but may not impact throughput)
  • 30. Scaling is Mostly Easy! § Using Scalable Open Source Big Data Technologies § Hosted by suitable Cloud providers § With suitable monitoring, understanding of autoscaling and how different software “knobs” interact, and by scaling incrementally © Instaclustr Pty Limited, 2022