John Roesler, Confluent, Software Engineer
Kafka Streams cluster management is getting a huge overhaul.
We are tackling many of the hardest problems in distributed systems to solve the biggest pain points in managing Streams application clusters. There are numerous improvements recently completed or in progress right now:
* Fast ownership transfer of tasks (KIP-429)
* Balanced assignment of tasks (KIP-441)
* Background state migration (KIP-441)
* High-availability Interactive Query (KIP-535)
* Graceful recovery from exceptions (KIP-572)
This talk is a deep dive into our solution to distributing and managing stateful workloads in Kafka Streams application clusters. We'll talk about the assignment algorithm itself, as well as how Streams is able to keep the cluster up for processing and serving queries while state gets migrated around the cluster in the background.
We'll go over the configurations and log messages as well as the behavior you'll observe when operating your cluster. Finally, we will talk about future opportunities for Streams cluster management and open the floor for discussion.
https://www.meetup.com/Dallas-Kafka/events/270468566/
4. “the quick brown fox”
“it was the best of times”
“it was the worst of times”
best brown fox it it of of quick
the the the times times was was
worst
1 best 1 brown 1 fox
2 it 2 of 1 quick
3 the 2 times 2 was
1 worst
sentences
split
repartition
count
word-counts
19. Assignment Checklist
● Balance the overall number of tasks
● Balance the active tasks
● Balance stateful tasks
● For each task, assign standby to different hosts
than active
21. Assignment Checklist
● Balance the overall number of tasks
● Balance the active tasks
● Balance stateful tasks
● For each task, assign standby to different hosts
than active
● Balance partitions for each task across nodes
41. Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2 Task 1_1
Host 1 Host 2
What if we lose a node? (Recovery)
Host 3
Task 1_0
Task 1_2
Task 1_1
42. Task 0_0 Task 0_1
Task 1_0 Task 1_2
Task 1_1
Task 1_1
Host 1 Host 2
What if we lose a node? (Recovery)
Host 3
Task 1_0
Task 1_2
43. Configs to care about
num_standbys
acceptable_recovery_lag
probing_rebalance_interval_ms
max_warmup_replicas
44. Other tips
Register a StateRestoreListener to monitor progress:
KafkaStreams#setGlobalStateRestoreListener
onRestoreStart(
TopicPartition topicPartition,
String storeName,
long startingOffset,
long endingOffset
);
onBatchRestored(
TopicPartition topicPartition,
String storeName,
long batchEndOffset,
long numRestored
);
onRestoreEnd(TopicPartition topicPartition, String storeName, long totalRestored);
45. Best log messages to watch out for
INFO Decided on assignment: {...} with followup probing
rebalance
INFO Scheduled a followup probing rebalance for ... ms.
INFO Finished unstable assignment of tasks, a followup probing
rebalance will be triggered.
INFO Decided on assignment: {...} with no followup probing
rebalance
INFO Finished stable assignment of tasks, no followup rebalances
required.
46. Kafka-Summit.org
A T T E N D S P E A K
COMMUNITY DISCOUNT
25% OFF
Use the discount code
KSA20Meetup
at
kafka-summit.org/
Submit a proposal to speak in Austin
Deadline 17 May 2020
Apply at kafka-summit.org/