I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
2. Outline
• Stream processing with Kafka
• Exactly-once for stream processing
• How Kafka Streams enabled exactly-once
2
3. 3
Stream Processing with Kafka
Process
State
Ads Clicks
Ads Displays
Billing Updates
Fraud Suspects
Your App
4. 4
Stream Processing with Kafka
Process
State
Ads Clicks
Ads Displays
Billing Updates
Fraud Suspects
ack
ack
commit
Your App
5. 5
Stream Processing: Do it Yourself
while (isRunning) {
// read some messages from Kafka
inputMessages = consumer.poll();
// do some processing…
// send output messages back to Kafka, wait for ack
producer.send(outputMessages).get();
// commit offsets for processed messages
consumer.commit(..);
}
6. 6
• Ordering
• Partitioning &
Scalability
• Fault tolerance
• State Management
• Time, Window &
Out-of-order Data
• Re-processing
DIY Stream Processing is Hard
8. 8
Exactly-Once
An application property for stream processing,
.. that for each received record,
.. its process results will be reflected exactly once,
.. even under failures
9. 9
Error Scenario #1: Duplicate
Writes
Process
State
Kafka Topic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
Streams App
10. 10
Error Scenario #1: Duplicate
Writes
Process
State
Kafka Topic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
ack
producer config: retries = N (default = 0)
Streams App
11. 11
Error Scenario #2: Re-process
Kafka Topic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
commit
ack
ack
State
Process
Streams App
12. 12
Error Scenario #2: Re-process
State
Process
Kafka Topic A
Kafka Topic B
Kafka Topic C
Kafka Topic D
State
Streams App
18. 18
• Building blocks to achieve exactly-once
• Idempotence: de-duped sends in order per partition
• Transactions: atomic multiple-sends across topic partitions
• Kafka Streams: enable exactly-once in a single
knob
Exactly-once, the Kafka Way!(0.11+)
23. Kafka Streams DSL
23
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.groupBy(..).count(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a streams client and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
24. Kafka Streams DSL
24
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.groupBy(..).count(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a streams client and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
25. Kafka Streams DSL
25
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.groupBy(..).count(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a streams client and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
26. Kafka Streams DSL
26
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.groupBy(..).count(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a streams client and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
30. State Store
30
State
KStream<..> stream1 = builder.stream(”topic1”);
KStream<..> stream2 = builder.stream(”topic2”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.groupBy(…).count(”store”);
aggregated.to(”topic3”);
31. Kafka Streams DSL
31
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.groupBy(..).count(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a streams client and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}
38. 38
Process
State
Kafka Topic C
Kafka Topic D
ack
ack
Kafka Topic A
Kafka Topic B
commit
Exactly-once with Kafka
• Acked produce to sink topics
• Offset commit for source
topics
• State update on processor
39. 39
• Acked produce to sink topics
• Offset commit for source
topics
• State update on processor
Exactly-Once with Kafka
40. 40
• Acked produce to sink topics
• Offset commit for source
topics
• State update on processor
All or Nothing
Exactly-Once with Kafka
41. 41
Exactly-Once with Kafka Streams
(0.11+)
• Acked produce to sink topics
• Offset commit for source
topics
• State update on processor
42. 42
Exactly-Once with Kafka Streams
(0.11+)
• Acked produce to sink topics
• A batch of records sent to the offset topic
• State update on processor
43. 43
• Acked produce to sink topics
• A batch of records sent to the offset topic
Exactly-Once with Kafka Streams
(0.11+)
• A batch of records sent to changelog
topics
44. 44
Exactly-Once with Kafka Streams
(0.11+)
• A batch of records sent to sink topics
• A batch of records sent to the offset topic
• A batch of records sent to changelog
topics
45. 45
Exactly-Once with Kafka Streams
(0.11+)
All or Nothing
• A batch of records sent to sink topics
• A batch of records sent to the offset topic
• A batch of records sent to changelog
topics
71. 71
Exactly-Once does NOT mean..
• Two Generals problem can now be solved
• .. or FLP result is proved wrong
• .. or TCP at transport level is “perfect”
• .. or you can get distributed consensus in any
settings
Thank you, and hello everyone, my name’s Guozhang. I’m very excited to be here and talk about …
Here is a quick spoiler alert of my talk.
I’ll start by giving some context about stream processing, and does it look like with Kafka. And then I will explain what actually does exactly-once mean for stream processing.
And finally, I will talk about what we have added in the latest release of 0.11, and how these building blocks are leveraged by Kafka’s stream processing API to help developers achieve exactly-once with Kafka.
So how does stream processing with Kafka look like? Here’s a concrete example: suppose you are building a real-time ads billing application with two input streams of data: one stream is a Kafka topic representing the events when an ad is displayed to some users, and the other streams is another Kafka topic representing the events when a user clicks on certain displayed ads.
These two streams of data are unbounded, since the front-end web servers will keep appending more and more events to these two Kafka topics.
The billing application’s goal, is to calculate the cost of each clicked ads for the corresponding client, plus alert on potential fraud user, who may click on lots of ads in a very short period of time. The output of this application, is also two data streams.
May then be consumed by an offline data analytics system generate the dashboards and reports and so on. So it forms the big picture of a pipeline.
And then it moves on to the next available event, and this loop repeat again.
In practice, though, developers would not commit on each single record for performance, instead they will only do that for a bunch of records.
So how to write this fetch-process-produce loop in code?
It seems OK at the first look, but until you deploy your code to production
Data may come out of ordering, computation and state may need to be partitioned for distributed partitioning
And more importantly, your app could fail at any time: bad config, human error deploying, a bug in your code.
You then need to worry about all these lower-level details about consistency and high-availability.
And all these issues would soon add up and cost you much more time than coding a first draft of your streaming applications.
Today, I would like to focus on a single slice of this iceberg, which is, how would you achieve correctness of your application along with fault tolerance.
It is not a network transport level guarantee, but really an app-level property, that given a stream processing application..
You lose that committed offset forever, hence even this record has been completed processing, Kafka would not know any more.
Bottom line: because of Kafka’s at-least-once semantics, we can have duplicated writes and processing.
This way looks nice at the first glance, but once you start doing it, you will realize the fact that since each application’s output could be other applications input, and since each application can potentially generate duplicated writes to its output topics.
You would ended up doing reduplication at each stage of your streaming computation pipeline. And it soon becomes a maintenance headache.
So, is there a better solution than this?
Strengthen Kafka itself and provide the building blocks to help developers to write their streaming applications that achieves exactly once.
Supports event-at-a-time stateful processing, handles out-of-order data arriving
Instead of external store, Kafka Streams would add a local state store associated with the stateful processors to maintain the running state.
In terms of distributed processing, Kafka Streams API have a tight integration with the Kafka’s topic partitions.
For a typical streaming application of Kafka
Idempotence is also enabled so that within a single life time of the producer, its resent duplicated data could be detected and rejected by brokers.
In practice, your streaming pipeline may not be completely within the closed world of Kafka.
Send a request to some REST proxy during the processing, or simply you need to pipe your end result into another data system
HDFS, S3, Elastic Search and JDBC Sink
distributed systems are all about trade-offs
Effective availability: the empirically measured percentage of successful requests over some period, often measured in “9s”.
Algorithmic availability: a liveness property of an algorithm where every request to a non-failing node must eventually return a valid response.
The CAP theorem is only concerned with algorithmic availability. An algorithmic availability of 100% does not guarantee an effective availability of 100%. The algorithmic availability from the CAP theorem only applies if both the implementation and the execution of the algorithm is without error. In practice, most outages to an AP system are not due to network issues, which the algorithm can handle, but rather to implementation defects, user errors, misconfiguration, resource limits, and misbehaving clients.