Kafka Streams: Perfectly Balanced as all things should be

Perfectly Balanced,
as All Streams Should Be
John Roesler
vvcephei@apache.org
https://s.apache.org/perfectly-balanced-streams

Problem 1: Cluster workload becomes skewed after adding nodes
Problem 2: Long restoration/rebalance pauses after adding nodes

builder
.stream("sentences")
.flatMapValues(whitespaceSplitter)
.groupBy((k, v) -> v)
.count()
.toStream()
.to("word-counts");
sentences
split
repartition
count
word-counts

“the quick brown fox”
“it was the best of times”
“it was the worst of times”
best brown fox it it of of quick
the the the times times was was
worst
1 best 1 brown 1 fox
2 it 2 of 1 quick
3 the 2 times 2 was
1 worst
sentences
split
repartition
count
word-counts

sentences-0
split
repartition-1
count
word-counts
sentences-1
split
repartition-0
count
repartition-2
count

sentences-0
split
repartition-1
count
word-counts
sentences-1
split
repartition-0
count
repartition-2
count
Subtopology 0
Subtopology 1

sentences-0
split
repartition-1
count
word-counts
sentences-1
split
repartition-0
count
repartition-2
count
Task 0_0
Task 1_0
Task 0_1
Task 1_1 Task 1_2

Task 0_0
Task 1_0
Task 0_1
Task 1_1 Task 1_2

Task 0_0
Task 1_0
Task 0_1
Task 1_1 Task 1_2
Task 1_2 Task 1_0 Task 1_1
Active Stateless
Active Stateful
Standby (Stateful)

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2

Balance Checklist
● Balance the overall number of tasks

Balance Checklist
● Balance the active tasks

Task 0_0
Task 0_1
Task 1_0
Task 1_1
Task 1_2
Host 1 Host 2
Task 0_2

Balance Checklist
● Balance stateful tasks

Task 0_0 Task 0_1
Task 1_0 Task 1_1
Task 1_2 Task 1_2
Task 1_0 Task 1_1
Host 1 Host 2

Assignment Checklist
● For each task, assign standby to different hosts
than active

Task 1_0
Task 1_1
Task 1_2
Host 1 Host 2
Task 0_0
Task 0_1
Task 0_1

Assignment Checklist
● For each task, assign standby to different hosts
than active
● Balance partitions for each task across nodes

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2 Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2 Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2
Host 3

Task 0_0
Task 0_1
Task 1_0
Task 1_1
Task 1_2 Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2
Host 3

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2 Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2
Host 3

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2
Task 1_0Task 1_1
Host 1 Host 2
Host 3

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2 Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2
Solution: Warm up the new host before moving stateful tasks
Host 3

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2 Task 1_2
Task 1_0
Task 1_1
Host 1 Host 2
Host 3
Task 1_2
Task 1_0

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2
Task 1_1
Host 1 Host 2
Host 3
Task 1_0

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2
Task 1_1
Host 1 Host 2
What if we lose a node?
Host 3
Task 1_0

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2
Host 1 Host 2
Host 3
Task 1_0

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2 Task 1_1
Host 1 Host 2
Host 3
Task 1_0

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2 Task 1_1
Host 1 Host 2
What if we lose a node? (Recovery)
Host 3
Task 1_0

Task 0_0 Task 0_1
Task 1_0
Task 1_1
Task 1_2
Task 1_2 Task 1_1
Host 1 Host 2
Host 3
Task 1_0
Task 1_2
Task 1_1

Task 0_0 Task 0_1
Task 1_0 Task 1_2
Task 1_1
Task 1_1
Host 1 Host 2
Host 3
Task 1_0
Task 1_2

Configs to care about
num_standbys
acceptable_recovery_lag
probing_rebalance_interval_ms
max_warmup_replicas

Other tips
Register a StateRestoreListener to monitor progress:
KafkaStreams#setGlobalStateRestoreListener
onRestoreStart(
TopicPartition topicPartition,
String storeName,
long startingOffset,
long endingOffset
);
onBatchRestored(
TopicPartition topicPartition,
String storeName,
long batchEndOffset,
long numRestored
);
onRestoreEnd(TopicPartition topicPartition, String storeName, long totalRestored);

Best log messages to watch out for
INFO Decided on assignment: {...} with followup probing
rebalance
INFO Scheduled a followup probing rebalance for ... ms.
INFO Finished unstable assignment of tasks, a followup probing
rebalance will be triggered.
INFO Decided on assignment: {...} with no followup probing
rebalance
INFO Finished stable assignment of tasks, no followup rebalances
required.

Kafka-Summit.org
A T T E N D S P E A K
COMMUNITY DISCOUNT
25% OFF
Use the discount code
KSA20Meetup
at
kafka-summit.org/
Submit a proposal to speak in Austin
Deadline 17 May 2020
Apply at kafka-summit.org/

John Roesler
vvcephei@apache.org
confluentcommunity.slack.com
kafka.apache.org/contact
https://s.apache.org/perfectly-balanced-streams
Questions?

Kafka Streams: Perfectly Balanced as all things should be

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (15)

Mais de confluent

Mais de confluent (20)

Último

Último (20)

Kafka Streams: Perfectly Balanced as all things should be