As more and more organizations and individual users turn to Apache Flink for their streaming workloads, there is a bigger demand for additional functionality out-of-the-box. On one hand, there is demand for more low-level APIs that allow for more control, while on the other, users ask for more high-level additions that make the common cases easier to express. This talk will present the new concepts added to the Datastream API in Flink-1.2 and for the upcoming Flink-1.3 release that tried to consolidate the aforementioned goals. We will talk, among others, about the ProcessFunction, a new low level stream processing primitive that gives the user full control over how each event is processed and can register and react to timers, changes in the windowing logic that allow for more flexible windowing strategies, side outputs, and new features concerning the Flink connectors.
7. Common Usecase Skeleton A
On each incoming element:
• update some state
• register a callback for a moment in the future
When that moment comes:
• Check a condition and perform a certain
action, e.g. emit an element
7
8. Use built-in windowing:
• +Expressive
• +A lot of functionality out-of-the-box
• - Not always intuitive
• - An overkill for simple cases
Write your own operator:
• - Too many things to account for in Flink 1.1
8
The Flink 1.1 way
9. The Flink 1.2 way: ProcessFunction
Gives access to all basic building blocks:
• Events
• Fault-tolerant, Consistent State
• Timers (event- and processing-time)
9
10. The Flink 1.2 way: ProcessFunction
Simple yet powerful API:
10
/**
* Process one element from the input stream.
*/
void processElement(I value, Context ctx, Collector<O> out) throws Exception;
/**
* Called when a timer set using {@link TimerService} fires.
*/
void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;
11. The Flink 1.2 way: ProcessFunction
Simple yet powerful API:
11
/**
* Process one element from the input stream.
*/
void processElement(I value, Context ctx, Collector<O> out) throws Exception;
/**
* Called when a timer set using {@link TimerService} fires.
*/
void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;
A collector to emit result
values
12. The Flink 1.2 way: ProcessFunction
Simple yet powerful API:
12
/**
* Process one element from the input stream.
*/
void processElement(I value, Context ctx, Collector<O> out) throws Exception;
/**
* Called when a timer set using {@link TimerService} fires.
*/
void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;
1. Get the timestamp of the element
2. Interact with the TimerService to:
• query the current time
• and register timers
1. Do the above
2. Query if we are operating on Event
or Processing time
13. ProcessFunction: example
Requirements:
• maintain counts per incoming key, and
• emit the key/count pair if no element came for
the key in the last 100 ms (in event time)
13
14. ProcessFunction: example
14
Implementation sketch:
• Store the count, key and last mod timestamp in
a ValueState (scoped by key)
• For each record:
• update the counter and the last mod timestamp
• register a timer 100ms from “now” (in event time)
• When the timer fires:
• check the callback’s timestamp against the last mod time for the
key and
• emit the key/count pair if they match
15. ProcessFunction: example
15
public class MyProcessFunction extends
RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {
@Override
public void open(Configuration parameters) throws Exception {
// register our state with the state backend
}
@Override
public void processElement(Tuple2<String, Long> value, Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// update our state and register a timer
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// check the state for the key and emit a result if needed
}
}
16. ProcessFunction: example
16
public class MyProcessFunction extends
RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {
private ValueState<MyStateClass> state;
@Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(
new ValueStateDescriptor<>("myState", MyStateClass.class));
}
}
17. ProcessFunction: example
17
public class MyProcessFunction extends
RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {
@Override
public void processElement(Tuple2<String, Long> value, Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
current.count++;
current.lastModified = ctx.timestamp();
state.update(current);
ctx.timerService().registerEventTimeTimer(current.timestamp + 100);
}
}
18. ProcessFunction: example
18
public class MyProcessFunction extends
RichProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {
@Override
public void onTimer(long timestamp, OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
CountWithTimestamp result = state.value();
if (timestamp == result.lastModified) {
out.collect(new Tuple2<String, Long>(result.key, result.count)); }
}
}
19. ProcessFunction: example
19
If your stream is not keyed, you can always
group on a dummy key
BEWARE: parallelism of 1
stream.keyBy("id")
.process(new MyProcessFunction())
20. ProcessFunction: miscellaneous
20
CoProcessFunction for low-level joins:
• Applied on two input streams
• Has two processElement() methods, one for each input stream
Upcoming releases may further enhance the
ProcessFunction/CoProcessFunction
Planning to transform all CEP operators to ProcessFunctions
22. Common Usecase Skeleton B
22
On each incoming element:
• extract some info from the element (e.g. key)
• query an external storage system (DB or KV-
store) for additional info
• emit an enriched version of the input element
23. Write a MapFunction that queries the DB:
• +Simple
• - Slow (synchronous access) or/and
• - Requires high parallelism (more tasks)
Write your own operator:
• - Too many things to account for in Flink 1.1
23
The Flink 1.1 way
24. Write a MapFunction that queries the DB:
• +Simple
• - Slow (synchronous access) or/and
• - Requires high parallelism (more tasks)
Write your own operator:
• - Too many things to account for in Flink 1.1
24
The Flink 1.1 way
28. Requirement:
• a client that supports asynchronous requests
Flink handles the rest:
• integration of async IO with DataStream API
• fault-tolerance
• order of emitted elements
• correct time semantics (event/processing time)
28
The Flink 1.2 way: AsyncFunction
29. Simple API:
/**
* Trigger async operation for each stream input.
*/
void asyncInvoke(IN input, AsyncCollector<OUT> collector) throws Exception;
API call:
/**
* Example async function call.
*/
DataStream<...> result = AsyncDataStream.(un)orderedWait(stream,
new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);
29
The Flink 1.2 way: AsyncFunction
30. The Flink 1.2 way: AsyncFunction
30
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
AsyncWaitOperator:
• a queue of “Promises”
• a separate thread (Emitter)
31. The Flink 1.2 way: AsyncFunction
31
Emitter
P2P3 P1P4
AsyncWaitOperator
• Wrap E5 in a “promise” P5
• Put P5 in the queue
• Call asyncInvoke(E5, P5)
E5
P5
asyncInvoke(E5, P5)P5
32. The Flink 1.2 way: AsyncFunction
32
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
asyncInvoke(E5, P5)P5
asyncInvoke(value, asyncCollector):
• a user-defined function
• value : the input element
• asyncCollector : the collector of the
result (when the query returns)
33. The Flink 1.2 way: AsyncFunction
33
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
asyncInvoke(E5, P5)P5
asyncInvoke(value, asyncCollector):
• a user-defined function
• value : the input element
• asyncCollector : the collector of the
result (when the query returns)
Future<String> future = client.query(E5);
future.thenAccept((String result) -> { P5.collect(
Collections.singleton(
new Tuple2<>(E5, result)));
});
34. The Flink 1.2 way: AsyncFunction
34
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
asyncInvoke(E5, P5)P5
asyncInvoke(value, asyncCollector):
• a user-defined function
• value : the input element
• asyncCollector : the collector of the
result (when the query returns)
Future<String> future = client.query(E5);
future.thenAccept((String result) -> { P5.collect(
Collections.singleton(
new Tuple2<>(E5, result)));
});
35. The Flink 1.2 way: AsyncFunction
35
Emitter
P2P3 P1P4
AsyncWaitOperator
E5
P5
asyncInvoke(E5, P5)P5
Emitter:
• separate thread
• polls queue for completed
promises (blocking)
• emits elements downstream
36. 36
The Flink 1.2 way: AsyncFunction
DataStream<Tuple2<String, String>> result =
AsyncDataStream.(un)orderedWait(stream,
new MyAsyncFunction(),
1000, TimeUnit.MILLISECONDS,
100);
our asyncFunction
a timeout: max time until considered failed
capacity: max number of in-flight requests
37. 37
The Flink 1.2 way: AsyncFunction
DataStream<Tuple2<String, String>> result =
AsyncDataStream.(un)orderedWait(stream,
new MyAsyncFunction(),
1000, TimeUnit.MILLISECONDS,
100);
38. 38
The Flink 1.2 way: AsyncFunction
DataStream<Tuple2<String, String>> result =
AsyncDataStream.(un)orderedWait(stream,
new MyAsyncFunction(),
1000, TimeUnit.MILLISECONDS,
100);
P2P3 P1P4E2E3 E1E4
Ideally... Emitter
39. 39
The Flink 1.2 way: AsyncFunction
DataStream<Tuple2<String, String>> result =
AsyncDataStream.unorderedWait(stream,
new MyAsyncFunction(),
1000, TimeUnit.MILLISECONDS,
100);
P2P3 P1P4E2E3 E1E4
Reallistically... Emitter
...output ordered based on which request finished first
40. 40
The Flink 1.2 way: AsyncFunction
P2P3 P1P4E2E3 E1E4
Emitter
unorderedWait: emit results in order of completion
orderedWait: emit results in order of arrival
Always: watermarks never overpass elements and vice versa
43. 43
One day of hands-on Flink training
One day of conference
Tickets are on sale
Call for Papers is already open
Please visit our website:
http://sf.flink-forward.org
Follow us on Twitter:
@FlinkForward
My name is Kostas Kloudas and I am here to talk to you about some of the latest extensions of Flink’s streaming APIs.
I bit about me, I am a Flink committer and a software engineer at data Artisans...
So, enough with the introductions, let’s cut to the chase. As the streaming space and the Flink community grow, Flink grows with them.
This has led to a number of cool new features being added in Flink 1.2.
These features range from:
engine enhancements to support better performance and features like rescaling and
low-level abstractions that allow for easier interaction with Flink’s the low level mechanisms
Higher level API enhancements that make the implementation of common cases easier
In this talk I will focus on the two highlighted features, namely: ....
An example could be that you have your recommendation system, and users that navigate from an item to its related/recommended ones.
In this case, and to adjust your recommendation algorithm, you can have a “rule” that says if the user does not purchase the related clicked item within X sec,
send a signal to the recommendation system that the recommendation was not good
For those of you familiar with the Flink APIs, you can imagine this as a flatMap with the ability to register and react to timers.
Not always intuitive: You do not want to think about assigners, triggers, functions etc when all you want to do is as simple as a flatmap with a timer
For 2nd: but if windowing is an overkill, imagine a custom operator.
So, that was in Flink 1.1 and these remain valid approaches also in Flink 1.2. But, to make things easier, Flink 1.2 ships with a new abstraction called the ProcessFunction, that was introduced to cover precisely these cases.
The ProcessFunction is a low-level stream processing operation, which gives access to the basic building blocks of all (acyclic) streaming applications:
events (stream elements)
state (fault tolerant, consistent)
timers (event time and processing time)
Again, you can imagine it as a flatmap with access to state and timers.
Focusing on the arguments of each of the calls:...
Emphasize that time stands for both event and processing time.
This example is copied from our documentation for which I will provide a link at the end of the slides (but you can always use your favorite search engine to look for ProcessFunction in Flink)
... Following the same pattern as before...
Let’s focus a bit on the “synchronous access” part and see what this stands for.
As shown in the figure, synchronous access means that after sending a request for key a, you have to wait for the response, before being able to send the next request for key b.
In the figure, with brown we show the waiting time, and we can see that this can easily dominate throughput and latency.
Let’s focus a bit on the “synchronous access” part and see what this stands for.
As shown in the figure, synchronous access means that after sending a request for key a, you have to wait for the response, before being able to send the next request for key b.
In the figure, with brown we show the waiting time, and we can see that this can easily dominate throughput and latency.
To face the problems of synchronous access, the asynchronous pattern allows for multiplexing requests and responses so that you send a request for a, b, c, etc
and, in the same time, you receive the responses as they arrive, without waiting between consecutive requests.
This is exactly the pattern that AsyncIO implements. And in order to leverage its capabilities, the only requirement it imposes is:
If you have this, then Flink will provide the rest, such as...
The API of the async function requires the implementation of a single method ... Which is the one that triggers an async operation for each input element.
And to integrate it into your program, you will have to write something like the following:
We will see more about the details of these methods in the following slides.
So now that we have the 10000 feet view of the async io, let’s see a little bit how this works:
This is the diagram of our AsyncWaitOperator, the operator that runs our asyncFunction.
As we can see, it is composed of a queue of ”Promises” and a separate Thread, the “Emitter”, which is responsible for sending
Elements (e.g. the received responses) downstream.
A ”promise” is an asynchronous abstraction which “promises” to have a value in the future.
This queue is the queue of PENDING promises, e.g. our pending requests.
A ”promise” is an asynchronous abstraction which “promises” to have a value in the future.
On this promise, we can attach a callback, which will be triggered upon completion of the requested action, i.e.
When the promise has a concrete value (or completes with an exception)
CLIENT should be asynchronous. If not, then the call will block in the query() and we will have the same synchronous pattern as before.
CLIENT should be asynchronous. If not, then the call will block in the query() and we will have the same synchronous pattern as before.
A ”promise” is an asynchronous abstraction which “promises” to have a value in the future.
On this promise, we can attach a callback, which will be triggered upon completion of the requested action, i.e.
When the promise has a concrete value (or completes with an exception)
Let’s focus a bit on the “synchronous access” part and see what this stands for...
As operations are served asynchronously, the order of the output elements will not be the same as the one of their respective input elements. This in fact depends on how fast the storage system serves each of the individual requests.
To control the order of the emitted events, Flink can operate on 2 modes:
As operations are served asynchronously, the order of the output elements will not be the same as the one of their respective input elements. This in fact depends on how fast the storage system serves each of the individual requests.
To control the order of the emitted events, Flink can operate on 2 modes:
As operations are served asynchronously, the order of the output elements will not be the same as the one of their respective input elements. This in fact depends on how fast the storage system serves each of the individual requests.
To control the order of the emitted events, Flink can operate on 2 modes: