"Kafka Streams makes it very easy to run a distributed stream processing system: you just start multiple processes with the same ""application id"", point them at the same Kafka cluster, and they will form a distributed system on their own. Making a distributed system so easy to start up has an Achilles' heel: there's no natural learning curve that forces you to learn how to operate the system in order to stand it up.
Of all the pitfalls this situation creates, by far the most complex and user confusing is the topic of rebalancing and task assignment. Kafka Streams partition and task assignment is especially complex because of advanced features like ""smooth scale-out"" that uses warm-up tasks and so-called probing rebalances. This talk aims to completely demystify the system from an operational perspective. You will learn what is really happening across Kafka and Kafka Streams, how to interpret the logs and metrics, and how to adjust the configs to achieve your desired outcomes."
4. Ice Breaker
How many people here use Kafka Streams?
How many of you are scared of rebalances?
4
5. Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
5
6. Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
6
7. Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
7
14. Why are Rebalances so important?
❌ In the normal course of events rebalances are fairly undesirable.
○ When partitions are moved from one consumer to another, the consumer loses its
current state;
■ if it was caching any data, it will need to refresh its caches
■ slowing down the application until the consumer sets up its state again.
15
✅ Rebalances provide the consumer group with high availability and elastic scalability
○ Easily and safely add and remove consumers
○ Survive consumer crash
Throughout this talk we will discuss how to safely handle rebalances and how to avoid
unnecessary ones.
15. How does Kafka make rebalancing process less
painful?
● 2.4: Optimistically continue processing during rebalances
○ Incremental Cooperative Protocol
● 2.6: Continue processing on stateful tasks while re-distributing state in the background
○ Smooth Scaling Protocol
16
21. Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
22
22. How to dig in?
23
● First challenge: logs are spread over all 10 different instances
● collect all logs into a common logging platform
● collate by timestamp, but not always easy to see the right causal ordering
○ Solution: line them up by Generation ID
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
23. Cluster state in each generation
24
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution find this log line: Assigned tasks […] including stateful [...] to
clients as:...
1f061087-7be9-48ef-9300-f1e8d88027ef=[activeTasks: ([0_0]) standbyTasks: ([0_1])]
f7420f7b-4a31-4a79-bf9b-279c259414a3=[activeTasks: ([0_1]) standbyTasks: ([0_2])]
0f134dd0-4ae3-4603-88d5-fe097fc9e9f5=[activeTasks: ([0_2]) standbyTasks: ([0_0])]
24. Cluster state in each generation
25
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution find this log line: Assigned tasks […] including stateful [...] to
clients as:...
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
25. Start at the beginning
26
● As with any troubleshooting, try to find the first event.
● In our case, the cluster went a while before rebalancing, and then had a bunch of
them,
● so we'll figure out the generation and cause of that first rebalance
26. Start at the beginning
27
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
3188
27. What happened?
28
● It looks like one of the instances (B) dropped out of the group
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
28. How do we react?
29
● B was active for 0_1 and standby for 0_2
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
29. How do we react (0_1)?
30
● B was active for 0_1 and standby for 0_2
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
30. How do we react (0_2)?
31
● B was active for 0_1 and standby for 0_2
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
● Need a new standby for 0_2, so A picks it up
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
31. B is back!
32
● Generation 3189: B is back in the cluster
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
32. B is back!
33
● Generation 3189: B is back in the cluster
● Leave everything where it is while B warms up on 0_1 and 0_2
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
33. B is back!
34
● Generation 3189: B is back in the cluster
● Leave everything where it is while B warms up on 0_1 and 0_2
● Schedule a probing rebalance to check on B in 10 minutes
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
34. Checking in on B
35
● Generation 3190: probing rebalance to check on the progress of the warmup
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
35. Checking in on B
36
● Generation 3190: probing rebalance to check on the progress of the warmup
● Not ready yet
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
36. B is ready for action!
37
● Generation 3191: B is warmed up!
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
37. B is ready for action!
38
● Generation 3191: B is warmed up!
● Make B active on 0_1 (and swap A back to standby)
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
38. B is ready for action!
39
● Generation 3191: B is warmed up!
● Make B active on 0_1 (and swap A back to standby)
● Drop extra standbys (0_2 on A and 0_1 on C)
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
39. Bonus round
40
● Surprise rebalance a few ms later!
3192
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
40. Bonus round
41
● Surprise rebalance a few ms later!
● Can't swap ownership of 0_1 in one rebalance
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
3192
41. Bonus round
42
● Surprise rebalance a few ms later!
● Can't swap ownership of 0_1 in one rebalance
● As soon as A revokes 0_1, it triggers another rebalance so B can start processing
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
3192
42. Recap
43
● Although we saw five rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
43. Recap
44
● Although we saw five rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
● It's not always easy to identify the first failure based on timing, but you can
eliminate probing and cooperative rebalances
● Sometimes there's more than one failure going on at the same time
44. Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. (Mostly) Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
45
46. (Alieh) Tuning the system (5-10 mintues)(old slide)
https://docs.google.com/document/d/1XMVUlRJUHTuas2aUqP1nY_nOuorMM2w8JPhhzdBos
zc/edit#heading=h.2j5e73a1tone
tuning in general is great, but ideally we'd tie it to the scenario somehow.
● more concurrent warmups -> reduce the total recovery time (eg for indeed it was 20
hours) -> didn't like moderate load for a long time; prefer higher load for a shorter time
● reduce task warmup time (by allowing higher tput from broker)
● improve HA -> more standbys
● consumer group protocol settings to avoid missed polls/hbs/etc. -> make it less
sensitive (and less responsive)
What to monitor
● poll, heartbeat
● state
● when to alert (SLA violation) vs. warn (frequent rebalances, etc.)
47
47. Tuning: Sensitivity Tradeoff
● Configs
○ max.poll.interval.ms
○ heartbeat.interval.ms, session.timeout.ms
● Less Sensitive: Prevent unnecessary rebalances
○ Occasional long I/O
○ Long GC pauses
○ Flaky networks
● More Sensitive: Detect failures faster
○ Might get unnecessary rebalance
○ Failover faster to meet uptime SLAs
48
48. Tuning: Speed up probing rebalance phase
● Real Scenario!
○ App had 60 tasks
○ Warming up a task took 30-40 min
○ Migrating two tasks at a time
⠀ 60 tasks * 40 minutes / 2 warmups
⠀ = 20 hours of probing rebalances
● Solution:
● More concurrent warmups: max.warmup.replicas
○ reduce the total recovery time (instead of having moderate load for a long
time, have a higher load for a shorter time)
● Reduce task warmup time (by allowing higher tput from broker)
○ max.poll.records, max.partition.fetch.bytes, fetch.max.bytes
49
49. What to monitor?
50
Probing Rebalance
last.rebalance.seconds.ago
Last poll seconds ago
Default max.poll.interval.ms
Last heartbeat seconds ago
Default session.timeout.ms
Heartbeat interval
last-poll-seconds-ago: consumer-metrics
max.poll.interval.ms: Consumer config
last-heartbeat-seconds-ago: consumer-metrics
session.timeout.ms: Consumer config
heartbeat.interval.ms: Consumer config
last.rebalance.seconds.ago: consumer-coordinator-metrics
probing.rebalance.interval.ms:: Streams config
50. Prediction: What will you see in the next Kafka
Summit?
● KIP-848: The Next Generation of the Consumer
Rebalance Protocol
○ Adds ability to compute assignments in the
broker
● Add generation id to all rebalance logs
51
51. Go forth and rebalance!
John Roesler
john@confluent.io
Alieh Saeedi
asaeedi@confluent.io