SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Kafka Streams Rebalances and Assignments:
The Whole Story
John Roesler
john@confluent.io
Alieh Saeedi
asaeedi@confluent.io
Agenda
2
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
Ice Breaker
How many people here use Kafka Streams?
3
Ice Breaker
How many people here use Kafka Streams?
How many of you are scared of rebalances?
4
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
5
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
6
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
7
Agenda
8
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
Wait a Minute…
What is a Rebalance?!
9
Kafka Streams
12
1
JoinGroup
(Subscription)
2
SyncGroup
(Assignment)
Task Assignment
13
1
JoinGroup
(Subscription)
2
SyncGroup
(Assignment)
Task Assignment
Rebalancing:
Moving task ownership
from one instance to another
(because of changes in consumer group or topic subscriptions)
14
Why are Rebalances so important?
❌ In the normal course of events rebalances are fairly undesirable.
○ When partitions are moved from one consumer to another, the consumer loses its
current state;
■ if it was caching any data, it will need to refresh its caches
■ slowing down the application until the consumer sets up its state again.
15
✅ Rebalances provide the consumer group with high availability and elastic scalability
○ Easily and safely add and remove consumers
○ Survive consumer crash
Throughout this talk we will discuss how to safely handle rebalances and how to avoid
unnecessary ones.
How does Kafka make rebalancing process less
painful?
● 2.4: Optimistically continue processing during rebalances
○ Incremental Cooperative Protocol
● 2.6: Continue processing on stateful tasks while re-distributing state in the background
○ Smooth Scaling Protocol
16
Old Protocol (before 2.4, 2019)
17
Incremental Cooperative Protocol (2.4 in 2019)
18
Incremental Cooperative Protocol (2.4 in 2019)
19
Smooth Scaling Protocol (2.6 in 2020)
20
Agenda
21
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
22
How to dig in?
23
● First challenge: logs are spread over all 10 different instances
● collect all logs into a common logging platform
● collate by timestamp, but not always easy to see the right causal ordering
○ Solution: line them up by Generation ID
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
Cluster state in each generation
24
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution find this log line: Assigned tasks […] including stateful [...] to
clients as:...
1f061087-7be9-48ef-9300-f1e8d88027ef=[activeTasks: ([0_0]) standbyTasks: ([0_1])]
f7420f7b-4a31-4a79-bf9b-279c259414a3=[activeTasks: ([0_1]) standbyTasks: ([0_2])]
0f134dd0-4ae3-4603-88d5-fe097fc9e9f5=[activeTasks: ([0_2]) standbyTasks: ([0_0])]
Cluster state in each generation
25
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution find this log line: Assigned tasks […] including stateful [...] to
clients as:...
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
Start at the beginning
26
● As with any troubleshooting, try to find the first event.
● In our case, the cluster went a while before rebalancing, and then had a bunch of
them,
● so we'll figure out the generation and cause of that first rebalance
Start at the beginning
27
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
3188
What happened?
28
● It looks like one of the instances (B) dropped out of the group
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
How do we react?
29
● B was active for 0_1 and standby for 0_2
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
How do we react (0_1)?
30
● B was active for 0_1 and standby for 0_2
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
How do we react (0_2)?
31
● B was active for 0_1 and standby for 0_2
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
● Need a new standby for 0_2, so A picks it up
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
B is back!
32
● Generation 3189: B is back in the cluster
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
B is back!
33
● Generation 3189: B is back in the cluster
● Leave everything where it is while B warms up on 0_1 and 0_2
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
B is back!
34
● Generation 3189: B is back in the cluster
● Leave everything where it is while B warms up on 0_1 and 0_2
● Schedule a probing rebalance to check on B in 10 minutes
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
Checking in on B
35
● Generation 3190: probing rebalance to check on the progress of the warmup
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
Checking in on B
36
● Generation 3190: probing rebalance to check on the progress of the warmup
● Not ready yet
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
B is ready for action!
37
● Generation 3191: B is warmed up!
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
B is ready for action!
38
● Generation 3191: B is warmed up!
● Make B active on 0_1 (and swap A back to standby)
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
B is ready for action!
39
● Generation 3191: B is warmed up!
● Make B active on 0_1 (and swap A back to standby)
● Drop extra standbys (0_2 on A and 0_1 on C)
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
Bonus round
40
● Surprise rebalance a few ms later!
3192
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
Bonus round
41
● Surprise rebalance a few ms later!
● Can't swap ownership of 0_1 in one rebalance
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
3192
Bonus round
42
● Surprise rebalance a few ms later!
● Can't swap ownership of 0_1 in one rebalance
● As soon as A revokes 0_1, it triggers another rebalance so B can start processing
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
3192
Recap
43
● Although we saw five rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
Recap
44
● Although we saw five rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
● It's not always easy to identify the first failure based on timing, but you can
eliminate probing and cooperative rebalances
● Sometimes there's more than one failure going on at the same time
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. (Mostly) Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
45
Agenda
46
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
(Alieh) Tuning the system (5-10 mintues)(old slide)
https://docs.google.com/document/d/1XMVUlRJUHTuas2aUqP1nY_nOuorMM2w8JPhhzdBos
zc/edit#heading=h.2j5e73a1tone
tuning in general is great, but ideally we'd tie it to the scenario somehow.
● more concurrent warmups -> reduce the total recovery time (eg for indeed it was 20
hours) -> didn't like moderate load for a long time; prefer higher load for a shorter time
● reduce task warmup time (by allowing higher tput from broker)
● improve HA -> more standbys
● consumer group protocol settings to avoid missed polls/hbs/etc. -> make it less
sensitive (and less responsive)
What to monitor
● poll, heartbeat
● state
● when to alert (SLA violation) vs. warn (frequent rebalances, etc.)
47
Tuning: Sensitivity Tradeoff
● Configs
○ max.poll.interval.ms
○ heartbeat.interval.ms, session.timeout.ms
● Less Sensitive: Prevent unnecessary rebalances
○ Occasional long I/O
○ Long GC pauses
○ Flaky networks
● More Sensitive: Detect failures faster
○ Might get unnecessary rebalance
○ Failover faster to meet uptime SLAs
48
Tuning: Speed up probing rebalance phase
● Real Scenario!
○ App had 60 tasks
○ Warming up a task took 30-40 min
○ Migrating two tasks at a time
⠀ 60 tasks * 40 minutes / 2 warmups
⠀ = 20 hours of probing rebalances
● Solution:
● More concurrent warmups: max.warmup.replicas
○ reduce the total recovery time (instead of having moderate load for a long
time, have a higher load for a shorter time)
● Reduce task warmup time (by allowing higher tput from broker)
○ max.poll.records, max.partition.fetch.bytes, fetch.max.bytes
49
What to monitor?
50
Probing Rebalance
last.rebalance.seconds.ago
Last poll seconds ago
Default max.poll.interval.ms
Last heartbeat seconds ago
Default session.timeout.ms
Heartbeat interval
last-poll-seconds-ago: consumer-metrics
max.poll.interval.ms: Consumer config
last-heartbeat-seconds-ago: consumer-metrics
session.timeout.ms: Consumer config
heartbeat.interval.ms: Consumer config
last.rebalance.seconds.ago: consumer-coordinator-metrics
probing.rebalance.interval.ms:: Streams config
Prediction: What will you see in the next Kafka
Summit?
● KIP-848: The Next Generation of the Consumer
Rebalance Protocol
○ Adds ability to compute assignments in the
broker
● Add generation id to all rebalance logs
51
Go forth and rebalance!
John Roesler
john@confluent.io
Alieh Saeedi
asaeedi@confluent.io

Mais conteúdo relacionado

Mais procurados

What to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy
What to do if Your Kafka Streams App Gets OOMKilled? with Andrey SerebryanskiyWhat to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy
What to do if Your Kafka Streams App Gets OOMKilled? with Andrey SerebryanskiyHostedbyConfluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...HostedbyConfluent
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersJean-Paul Azar
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka confluent
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluHostedbyConfluent
 
Design and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative RebalancingDesign and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative Rebalancingconfluent
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Building Event Driven Services with Kafka Streams
Building Event Driven Services with Kafka StreamsBuilding Event Driven Services with Kafka Streams
Building Event Driven Services with Kafka StreamsBen Stopford
 

Mais procurados (20)

What to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy
What to do if Your Kafka Streams App Gets OOMKilled? with Andrey SerebryanskiyWhat to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy
What to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
 
Design and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative RebalancingDesign and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative Rebalancing
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Building Event Driven Services with Kafka Streams
Building Event Driven Services with Kafka StreamsBuilding Event Driven Services with Kafka Streams
Building Event Driven Services with Kafka Streams
 

Mais de HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

Mais de HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Último

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi & John Roesler

  • 1. Kafka Streams Rebalances and Assignments: The Whole Story John Roesler john@confluent.io Alieh Saeedi asaeedi@confluent.io
  • 3. Ice Breaker How many people here use Kafka Streams? 3
  • 4. Ice Breaker How many people here use Kafka Streams? How many of you are scared of rebalances? 4
  • 5. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING) ● Another rebalance 10 minutes later ● Cycle continues all day 5
  • 6. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING) ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. Good 3. Not enough information 6
  • 7. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING) ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. Good 3. Not enough information What will you do? 1. Wait for another day 2. Try restarting some/all instances 3. Nothing: my on-call shift is over! 7
  • 9. Wait a Minute… What is a Rebalance?! 9
  • 13. Rebalancing: Moving task ownership from one instance to another (because of changes in consumer group or topic subscriptions) 14
  • 14. Why are Rebalances so important? ❌ In the normal course of events rebalances are fairly undesirable. ○ When partitions are moved from one consumer to another, the consumer loses its current state; ■ if it was caching any data, it will need to refresh its caches ■ slowing down the application until the consumer sets up its state again. 15 ✅ Rebalances provide the consumer group with high availability and elastic scalability ○ Easily and safely add and remove consumers ○ Survive consumer crash Throughout this talk we will discuss how to safely handle rebalances and how to avoid unnecessary ones.
  • 15. How does Kafka make rebalancing process less painful? ● 2.4: Optimistically continue processing during rebalances ○ Incremental Cooperative Protocol ● 2.6: Continue processing on stateful tasks while re-distributing state in the background ○ Smooth Scaling Protocol 16
  • 16. Old Protocol (before 2.4, 2019) 17
  • 19. Smooth Scaling Protocol (2.6 in 2020) 20
  • 21. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING ● ), but some instances don't get an assignment ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. Good 3. Not enough information What will you do? 1. Wait for another day 2. Try restarting some/all instances 3. Nothing: my on-call shift is over! 22
  • 22. How to dig in? 23 ● First challenge: logs are spread over all 10 different instances ● collect all logs into a common logging platform ● collate by timestamp, but not always easy to see the right causal ordering ○ Solution: line them up by Generation ID 2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
  • 23. Cluster state in each generation 24 ● Second challenge: Keeping track of which instances are in the cluster and the tasks that they're assigned ○ Solution find this log line: Assigned tasks […] including stateful [...] to clients as:... 1f061087-7be9-48ef-9300-f1e8d88027ef=[activeTasks: ([0_0]) standbyTasks: ([0_1])] f7420f7b-4a31-4a79-bf9b-279c259414a3=[activeTasks: ([0_1]) standbyTasks: ([0_2])] 0f134dd0-4ae3-4603-88d5-fe097fc9e9f5=[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 24. Cluster state in each generation 25 ● Second challenge: Keeping track of which instances are in the cluster and the tasks that they're assigned ○ Solution find this log line: Assigned tasks […] including stateful [...] to clients as:... A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 25. Start at the beginning 26 ● As with any troubleshooting, try to find the first event. ● In our case, the cluster went a while before rebalancing, and then had a bunch of them, ● so we'll figure out the generation and cause of that first rebalance
  • 26. Start at the beginning 27 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing 2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … } 3188
  • 27. What happened? 28 ● It looks like one of the instances (B) dropped out of the group 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 28. How do we react? 29 ● B was active for 0_1 and standby for 0_2 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 29. How do we react (0_1)? 30 ● B was active for 0_1 and standby for 0_2 ● A had a standby for 0_1, so it will take over as active ● Need a new standby for 0_1, so C picks it up 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 30. How do we react (0_2)? 31 ● B was active for 0_1 and standby for 0_2 ● A had a standby for 0_1, so it will take over as active ● Need a new standby for 0_1, so C picks it up ● Need a new standby for 0_2, so A picks it up 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 31. B is back! 32 ● Generation 3189: B is back in the cluster 3189 2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
  • 32. B is back! 33 ● Generation 3189: B is back in the cluster ● Leave everything where it is while B warms up on 0_1 and 0_2 3189 2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
  • 33. B is back! 34 ● Generation 3189: B is back in the cluster ● Leave everything where it is while B warms up on 0_1 and 0_2 ● Schedule a probing rebalance to check on B in 10 minutes 3189 2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
  • 34. Checking in on B 35 ● Generation 3190: probing rebalance to check on the progress of the warmup 3190 2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance 2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance 2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
  • 35. Checking in on B 36 ● Generation 3190: probing rebalance to check on the progress of the warmup ● Not ready yet 3190 2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance 2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance 2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
  • 36. B is ready for action! 37 ● Generation 3191: B is warmed up! 3191 A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 37. B is ready for action! 38 ● Generation 3191: B is warmed up! ● Make B active on 0_1 (and swap A back to standby) 3191 A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 38. B is ready for action! 39 ● Generation 3191: B is warmed up! ● Make B active on 0_1 (and swap A back to standby) ● Drop extra standbys (0_2 on A and 0_1 on C) 3191 A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 39. Bonus round 40 ● Surprise rebalance a few ms later! 3192 2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
  • 40. Bonus round 41 ● Surprise rebalance a few ms later! ● Can't swap ownership of 0_1 in one rebalance 2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_3])] 3192
  • 41. Bonus round 42 ● Surprise rebalance a few ms later! ● Can't swap ownership of 0_1 in one rebalance ● As soon as A revokes 0_1, it triggers another rebalance so B can start processing 2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_3])] 3192
  • 42. Recap 43 ● Although we saw five rebalances, there was really just one bad event. ● All the others were just the cluster healing itself ● Longest outage was detecting the failure. ○ Task 0_1 was down for session.timeout.ms when instance B timed out
  • 43. Recap 44 ● Although we saw five rebalances, there was really just one bad event. ● All the others were just the cluster healing itself ● Longest outage was detecting the failure. ○ Task 0_1 was down for session.timeout.ms when instance B timed out ● It's not always easy to identify the first failure based on timing, but you can eliminate probing and cooperative rebalances ● Sometimes there's more than one failure going on at the same time
  • 44. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING ● ), but some instances don't get an assignment ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. (Mostly) Good 3. Not enough information What will you do? 1. Wait for another day 2. Try restarting some/all instances 3. Nothing: my on-call shift is over! 45
  • 46. (Alieh) Tuning the system (5-10 mintues)(old slide) https://docs.google.com/document/d/1XMVUlRJUHTuas2aUqP1nY_nOuorMM2w8JPhhzdBos zc/edit#heading=h.2j5e73a1tone tuning in general is great, but ideally we'd tie it to the scenario somehow. ● more concurrent warmups -> reduce the total recovery time (eg for indeed it was 20 hours) -> didn't like moderate load for a long time; prefer higher load for a shorter time ● reduce task warmup time (by allowing higher tput from broker) ● improve HA -> more standbys ● consumer group protocol settings to avoid missed polls/hbs/etc. -> make it less sensitive (and less responsive) What to monitor ● poll, heartbeat ● state ● when to alert (SLA violation) vs. warn (frequent rebalances, etc.) 47
  • 47. Tuning: Sensitivity Tradeoff ● Configs ○ max.poll.interval.ms ○ heartbeat.interval.ms, session.timeout.ms ● Less Sensitive: Prevent unnecessary rebalances ○ Occasional long I/O ○ Long GC pauses ○ Flaky networks ● More Sensitive: Detect failures faster ○ Might get unnecessary rebalance ○ Failover faster to meet uptime SLAs 48
  • 48. Tuning: Speed up probing rebalance phase ● Real Scenario! ○ App had 60 tasks ○ Warming up a task took 30-40 min ○ Migrating two tasks at a time ⠀ 60 tasks * 40 minutes / 2 warmups ⠀ = 20 hours of probing rebalances ● Solution: ● More concurrent warmups: max.warmup.replicas ○ reduce the total recovery time (instead of having moderate load for a long time, have a higher load for a shorter time) ● Reduce task warmup time (by allowing higher tput from broker) ○ max.poll.records, max.partition.fetch.bytes, fetch.max.bytes 49
  • 49. What to monitor? 50 Probing Rebalance last.rebalance.seconds.ago Last poll seconds ago Default max.poll.interval.ms Last heartbeat seconds ago Default session.timeout.ms Heartbeat interval last-poll-seconds-ago: consumer-metrics max.poll.interval.ms: Consumer config last-heartbeat-seconds-ago: consumer-metrics session.timeout.ms: Consumer config heartbeat.interval.ms: Consumer config last.rebalance.seconds.ago: consumer-coordinator-metrics probing.rebalance.interval.ms:: Streams config
  • 50. Prediction: What will you see in the next Kafka Summit? ● KIP-848: The Next Generation of the Consumer Rebalance Protocol ○ Adds ability to compute assignments in the broker ● Add generation id to all rebalance logs 51
  • 51. Go forth and rebalance! John Roesler john@confluent.io Alieh Saeedi asaeedi@confluent.io