Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Introducing Kafka Streams, the new stream processing library of Apache Kafka, Berlin Buzzwords 2016
1. 1Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Introducing Kafka Streams
Michael Noll
michael@confluent.io
Berlin Buzzwords, June 06, 2016
miguno
7. 7Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Stream processing in Real Life™
…before Kafka Streams
…somewhat exaggerated
…but perhaps not that much
14. 14Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016Taken at a session at ApacheCon: Big Data, Hungary, September 2015
15. 15Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Kafka Streams
stream processing made simple
16. 16Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Kafka Streams
• Powerful yet easy-to use Java library
• Part of open source Apache Kafka, introduced in v0.10, May 2016
• Source code: https://github.com/apache/kafka/tree/trunk/streams
• Build your own stream processing applications that are
• highly scalable
• fault-tolerant
• distributed
• stateful
• able to handle late-arriving, out-of-order data
• <more on this later>
21. 21Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
When to use Kafka Streams (as of Kafka 0.10)
Recommended use cases
• Application Development
• “Fast Data” apps (small or big
data)
• Reactive and stateful
applications
• Linear streams
• Event-driven systems
• Continuous transformations
• Continuous queries
• Microservices
Questionable use cases
• Data Science / Data
Engineering
• “Heavy lifting”
• Data mining
• Non-linear, branching streams
(graphs)
• Machine learning, number
crunching
• What you’d do in a data
warehouse
22. 22Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Alright, can you show me some code now?
KStream<Integer, Integer> input = builder.stream(“numbers-topic”);
// Stateless computation
KStream<Integer, Integer> doubled = input.mapValues(v -> v * 2);
// Stateful computation
KTable<Integer, Integer> sumOfOdds = input
.filter((k,v) -> v % 2 != 0)
.selectKey((k, v) -> 1)
.reduceByKey((v1, v2) -> v1 + v2, ”sum-of-odds");
• API option 1: Kafka Streams DSL (declarative)
23. 23Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Alright, can you show me some code now?
Startup
Process a record
Periodic action
Shutdown
• API option 2: low-level Processor API (imperative)
24. 24Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
API != everything
API, coding
Operations, debugging, …
“Full stack” evaluation
25. 25Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
API != everything
API, coding
Operations, debugging, …
“Full stack” evaluation
26. 26Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Kafka Streams outsources hard problems to Kafka
27. 27Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
How do I install Kafka Streams?
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>0.10.0.0</version>
</dependency>
• There is and there should be no “install”.
• It’s a library. Add it to your app like any other library.
29. 29Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Do I need to install a CLUSTER to run my apps?
• No, you don’t. Kafka Streams allows you to stay lean and lightweight.
• Unlearn bad habits: “do cool stuff with data != must have cluster”
Ok. Ok. Ok. Ok.
30. 30Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
How do I package and deploy my apps? How do I …?
32. 32Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
How do I package and deploy my apps? How do I …?
• Whatever works for you. Stick to what you/your company think is the
best way.
• Why? Because an app that uses Kafka Streams is…a normal Java app.
• Your Ops/SRE/InfoSec teams may finally start to love not hate you.
40. 40Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simpl
http://docs.confluent.io/3.0.0/streams/concepts.html#duality-of-streams-and-tables
41. 41Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
alice 2 bob 10 alice 3
time
“Alice clicked 2 times.”
“Alice clicked 2 times.”KTable
= interprets data as changelog stream
~ is a continuously updated materialized view
KStream
= interprets data as record stream
42. 42Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
alice 2 bob 10 alice 3
time
“Alice clicked 2+3 = 5
times.”
“Alice clicked 2 3 times.”KTable
= interprets data as changelog stream
~ is a continuously updated materialized view
KStream
= interprets data as record stream
43. 43Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
alice eggs bob lettuce alice milk
alice paris bob zurich alice berlin
KStream
KTable
time
“Alice bought eggs.”
“Alice is currently in Paris.”
User purchase records
User profile/location information
44. 44Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
alice eggs bob lettuce alice milk
alice paris bob zurich alice berlin
KStream
KTable
User purchase records
User profile/location information
time
“Alice bought eggs and milk.”
“Alice is currently in Paris Berlin.”
Show me
ALL values for
a key
Show me the
LATEST value
for a key
45. 45Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
46. 46Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
• JOIN example: compute user clicks by region via
KStream.leftJoin(KTable)
47. 47Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
• JOIN example: compute user clicks by region via
KStream.leftJoin(KTable)
Even simpler in Scala because, unlike Java, it natively supports tuples:
48. 48Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
• JOIN example: compute user clicks by region via
KStream.leftJoin(KTable)
alice 13 bob 5Input KStream
alice
(europe,
13)
bob (europe, 5)leftJoin()
w/ KTable
KStream
13europe 5europemap() KStream
KTablereduceByKey(_ + _) 13europe
……
……
18europe
……
……
49. 49Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
50. 50Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Streams meet Tables – in the Kafka Streams DSL
52. 52Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Key features in 0.10
• Native, 100%-compatible Kafka integration
• Also inherits Kafka’s security model, e.g. to encrypt data-in-transit
• Uses Kafka as its internal messaging layer, too
53. 53Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Native Kafka integration
• Reading data from Kafka
• Writing data to Kafka
54. 54Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Native Kafka integration
• You can configure both Kafka Streams plus the underlying Kafka clients
55. 55Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Key features in 0.10
• Native, 100%-compatible Kafka integration
• Also inherits Kafka’s security model, e.g. to encrypt data-in-transit
• Uses Kafka as its internal messaging layer, too
• Highly scalable
• Fault-tolerant
• Elastic
60. 60Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Kafka Streams outsources hard problems to Kafka
61. 61Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Key features in 0.10
• Native, 100%-compatible Kafka integration
• Also inherits Kafka’s security model, e.g. to encrypt data-in-transit
• Uses Kafka as its internal messaging layer, too
• Highly scalable
• Fault-tolerant
• Elastic
• Stateful and stateless computations (e.g. joins, aggregations)
62. 62Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Stateful computations
• Stateful computations like aggregations or joins require state
• We already showed a join example in the previous slides.
• Windowing a stream is stateful, too, but let’s ignore this for now.
• State stores in Kafka Streams
• Typically: key-value stores
• Pluggable implementation: RocksDB (default), in-memory, your own …
• State stores are per stream task for isolation (think: share-nothing)
• State stores are local for best performance
• State stores are replicated to Kafka for elasticity and for fault-tolerance
63. 63Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Execution model
State stores
(This is a bit simplified.)
65. 65Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Execution model
State stores
(This is a bit simplified.)
charlie 3
bob 1
alice 1
alice 2
66. 66Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Execution model
State stores
(This is a bit simplified.)
67. 67Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Execution model
State stores
(This is a bit simplified.)
alice 1
alice 2
68. 68Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Kafka Streams outsources hard problems to Kafka
69. 69Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Stateful computations
• Kafka Streams DSL: abstracts state stores away from you
• Stateful operations include
• count(), reduceByKey(), aggregate(), …
• Low-level Processor API: direct access to state stores
• Very flexible but more manual work for you
70. 70Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Stateful computations
• Use the low-level Processor API to interact directly with state stores
Get the store
Use the store
71. 71Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Key features in 0.10
• Native, 100%-compatible Kafka integration
• Also inherits Kafka’s security model, e.g. to encrypt data-in-transit
• Uses Kafka as its internal messaging layer, too
• Highly scalable
• Fault-tolerant
• Elastic
• Stateful and stateless computations
• Time model
74. 74Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Time
• You configure the desired time semantics through timestamp extractors
• Default extractor yields event-time semantics
• Extracts embedded timestamps of Kafka messages (introduced in v0.10)
75. 75Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Key features in 0.10
• Native, 100%-compatible Kafka integration
• Also inherits Kafka’s security model, e.g. to encrypt data-in-transit
• Uses Kafka as its internal messaging layer, too
• Highly scalable
• Fault-tolerant
• Elastic
• Stateful and stateless computations
• Time model
• Windowing
76. 76Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Windowing
TimeWindows.of(3000)
TimeWindows.of(3000).advanceBy(1000)
77. 77Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Windowing use case: monitoring (1m/5m/15m
averages)
Confluent Control Center for Kafka
78. 78Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Key features in 0.10
• Native, 100%-compatible Kafka integration
• Also inherits Kafka’s security model, e.g. to encrypt data-in-transit
• Uses Kafka as its internal messaging layer, too
• Highly scalable
• Fault-tolerant
• Elastic
• Stateful and stateless computations
• Time model
• Windowing
• Supports late-arriving and out-of-order data
• Millisecond processing latency, no micro-batching
• At-least-once processing guarantees (exactly-once is in the works)
80. 80Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Where to go from here?
• Kafka Streams is available in Apache Kafka 0.10 and Confluent Platform
3.0
• http://kafka.apache.org/
• http://www.confluent.io/download (free + enterprise versions,
tar/zip/deb/rpm)
• Kafka Streams demos at https://github.com/confluentinc/examples
• Java 7, Java 8+ with lambdas, and Scala
• WordCount, Joins, Avro integration, Top-N computation, Windowing, …
• Apache Kafka documentation:
http://kafka.apache.org/documentation.html
• Confluent documentation: http://docs.confluent.io/3.0.0/streams/
• Quickstart, Concepts, Architecture, Developer Guide, FAQ
• Join our bi-weekly Ask Me Anything sessions on Kafka Streams
• Contact me at michael@confluent.io for details
81. 81Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Some of the things to come
• Exactly-once semantics
• Queriable state – tap into the state of your applications
• SQL interface
• Listen to and collaborate with the developer community
• Your feedback counts a lot! Share it via users@kafka.apache.org
Tomorrow’s keynote (09:30 AM) by Neha
Narkhede,
co-founder and CTO of Confluent
“Application development and data in the emerging
world of stream processing”
82. 82Introducing Kafka Streams, Michael G. Noll, Berlin Buzzwords, June 2016
Want to contribute to Kafka and open source?
Join the Kafka community
http://kafka.apache.org/
Questions, comments? Tweet with #bbuzz and /cc to @ConfluentInc
…in a great team with the creators of Kafka?
Confluent is hiring
http://confluent.io/
Notas do Editor
Let’s say you want to brew yourself a delicious coffee (= something of value to you).
So when you’re at home, what do you need?
First of all, you need water (data) because it is a key ingredient for coffee. So how do you get the water? Right – as a continuous stream through your water pipes (Kafka). Putting pipes in place typically requires some upfront work but it’s a crucial stepping stone that unlocks and enables many things – such as brewing coffee.
But while having convenient access to the water through water pipes is necessary, it is not sufficient. You need something to transform the water into the tasty coffee.
So we need a coffee machine (Kafka Streams). And this coffee machine should (1) plug right into your water pipes and (2) brew wonderful coffee without forcing you to take a 6-month barista class.
One of the things you will realize: As a developer, you’ll be forced to become an Ops person, too, and you’ll need to become that very quickly.
There’s a tremendous discrepancy between how we want to work and how we actually do work.
When you are designing a new tool (like Kafka Streams) or when you are evaluating an existing one, you should ask yourself: "Is this tool part of the solution, or part of the problem?”
What do we gain if we make complex things simple, easy, and fun? One gain is developer efficiency.
Apache Kafka has made the streaming of data (= the transport and sharing of data) simple, easy, and fun.
With Kafka Streams we set out to do the same for the *processing* of stream data.
Basic workflow is:
Get the input data into Kafka, e.g. via Kafka Connect (part of Apache Kafka) or via your own applications that write data to Kafka.
Process the data with Kafka Streams, and write the results back to Kafka.
Basic workflow is:
Get the input data into Kafka, e.g. via Kafka Connect (part of Apache Kafka) or via your own applications that write data to Kafka.
Process the data with Kafka Streams, and write the results back to Kafka.
FYI: Some people have begun using the low-level Processor API to port their Apache Storm code to Kafka Streams.
If you ARE running at large scale, what is the cost for doing so (notably operations)?
If you ARE NOT running at large scale, what is the tool’s minimum investment? Is the tool cost-effective for small to medium scale, too?
Same questions if you set out to design and build a tool.
If you ARE running at large scale, what is the cost for doing so (notably operations)?
If you ARE NOT running at large scale, what is the tool’s minimum investment? Is the tool cost-effective for small to medium scale, too?
Same questions if you set out to design and build a tool.
In some sense, Kafka Streams exploits economies of scale. Projects like Mesos or Kubernetes, for example, are totally focused on resource management and scheduling, and they’ll always do a better job here than a tool that’s focused on stream processing. Kafka Streams should rather allow for “composition” with these deployment tools and resources managers (think: Unix philosophy) rather than being strongly opinionated and dictating any such choices upon you.
Stream: ordered, re-playable, fault-tolerant sequence of immutable data records
Data records: key-value pairs (closely related to Kafka’s key-value messages)
Processor topology: computational logic of an app’s data processing
Defined via the Kafka Streams DSL or the low-level Processor API
Stream processor: a node in the topology, represents a processing step
As a user, you will only be exposed to these nuts n’ bolts if you use the Processor API. The Kafka Streams DSL hides this from you.
Stream partitions and stream tasks are the logical units of parallelism
Stream partition: totally ordered sequence of data records; maps to a Kafka topic
A data record in the stream maps to a Kafka message from that topic
The keys of data records determine the partitioning of data in both Kafka and Kafka Streams, i.e. how data is routed to specific partitions
A processor topology is scaled by breaking it into multiple stream tasks, based on number of input stream partitions
Stream tasks work in isolation, i.e. independent from each other
Each Stream task has its own local, fault-tolerant state
KStream ~ records are interpreted as INSERTs (since no record replaces any existing record)
KTable – records are interpreted as UPDATEs (since any existing row with the same key is overwritten)
Note: We ignore Bob in this diagram. Bob is only shown to highlight that there is generally more data than just Alice’s.
KStream ~ records are interpreted as INSERTs (since no record replaces any existing record)
KTable – records are interpreted as UPDATEs (since any existing row with the same key is overwritten)
Note: We ignore Bob in this diagram. Bob is only shown to highlight that there is generally more data than just Alice’s.
You can go back and forth between streams and tables.
If you need to turn a stream into a table, you must specify the semantics of this transformation, for example when doing aggregations. These semantics can’t be second-guessed by a stream processing tool – for example, do you want to add or to multiply numbers in a stream? Only you know.
In contrast, turning a table into a stream is straight-forward – we can simply iterate through every key/row in the table.
To summarize: some operations such as map() retain the shape (e.g. a stream will stay a stream), some operations change the shape (e.g. a stream will become a table).
A notable aspect of KTable is that it is being continuously updated “behind the scenes” as new data arrives to its input topic. This means that the join operations will automatically be using the latest available data for the KTable.
Careful: Keep note on how the time flows in this diagram!
A proper notion and model of time is crucial for stream processing