1. Stephan Ewen, Kostas Tzoumas
Flink committers
co-founders @ data Artisans
@StephanEwen, @kostas_tzoumas
Flink-0.10
2. What is Flink
2
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow runtime
3. Flink 0.10 Summary
Focus on operational readiness
• high availability
• monitoring
• integration with other systems
First-class support for event time
Refined DataStream API: easy and powerful
3
4. Improved DataStream API
Stream data analysis differs from batch data
analysis by introducing time
Streams are unbounded and produce data over
time
Simple as batch API if handling time in a simple
way
Powerful if you want to handle time in an
advanced way (out-of-order records,
preliminary results, etc)
4
5. Improved DataStream API
5
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
6. Improved DataStream API
6
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
.keyBy("location")
.timeWindow(Time.of(15, MINUTES), Time.of(5, MINUTES))
.sum("numVehicles")
7. Improved DataStream API
7
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
.keyBy("location")
.timeWindow(Time.of(15, MINUTES), Time.of(5, MINUTES))
.trigger(new Threshold(200))
.sum("numVehicles")
8. Improved DataStream API
8
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …;
stream
.filter { evt => isIntersection(evt.location) }
.keyBy("location")
.timeWindow(Time.of(15, MINUTES), Time.of(5, MINUTES))
.trigger(new Threshold(200))
.sum("numVehicles")
.keyBy( evt => evt.location.grid )
.mapWithState { (evt, state: Option[Model]) => {
val model = state.orElse(new Model())
(model.classify(evt), Some(model.update(evt)))
}}
9. IoT / Mobile Applications
9
Events occur on devices
Queue / Log
Events analyzed in a
data streaming
system
Stream Analysis
Events stored in a log
13. IoT / Mobile Applications
13
Out of order !!!
First burst of events
Second burst of events
14. IoT / Mobile Applications
14
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out of order time (event time) windows,
arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events
15. 15
We need a
Global Clock
that runs on
event time
instead of
processing time.
16. 16
This is a source
This is our window operator
1
0
0
0 0
1
2
1
2
1
1
This is the current event-time time
2
2
2
2
2
This is a watermark.
21. High Availability and Consistency
21
No Single-Point-Of-Failure
any more
Exactly-once processing semantics
across pipeline
Checkpoints/Fault Tolerance is decoupled from windows
Allows for highly flexible window implementations
ZooKeeper
ensemble
Multiple
Masters
failover
22. Operator State
Stateless operators
System state
User defined state
22
ds.filter(_ != 0)
ds.keyBy("id").timeWindow(Time.of(5, SECONDS)).reduce(…)
public class CounterSum implements RichReduceFunction<Long> {
private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception {
counter.update(counter.value() + 1);
return v1 + v2;
}
@Override public void open(Configuration config) {
counter = getRuntimeContext().getOperatorState(“counter”, 0L, false);
}
}
26. High Availability and Consistency
26
No Single-Point-Of-Failure
any more
Exactly-once processing semantics
across pipeline
Checkpoints/Fault Tolerance is decoupled from windows
Allows for highly flexible window implementations
ZooKeeper
ensemble
Multiple
Masters
failover
39. Batch and Streaming
39
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new EOFTrigger())
.sum("count")
Batch Word Count in the DataStream API
40. Batch and Streaming
40
Batch Word Count in the DataSet API
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new EOFTrigger())
.sum("count")
val text: DataSet[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.groupBy("word")
.sum("count")
41. Batch and Streaming
41
Pipelined and
blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational
Optimizer
Window
Optimization
Pipelined and
windowed operators
Schedule lazily
Schedule eagerly
Recompute whole
operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery
Fully buffered streams DAG resource management
Streaming
Parameters
45. Monitoring
45
Life system metrics and
user-defined accumulators/statistics
Get http://flink-m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators
Monitoring REST API for
custom monitoring tools
{ "id": "dceafe2df1f57a1206fcb907cb38ad97", "user-accumulators": [
{ "name":"avglen", "type":"DoubleCounter", "value":"123.03259440000001" },
{ "name":"genwords", "type":"LongCounter", "value":"75000000" } ] }
46. Flink 0.10 Summary
Focus on operational readiness
• high availability
• monitoring
• integration with other systems
First-class support for event time
Refined DataStream API: easy and powerful
46
Notas do Editor
People previously made the case that high throughput and low latency are mutually exclusive