SlideShare uma empresa Scribd logo
1 de 61
Stream & Batch Processing
with Apache Flink
Robert Metzger
@rmetzger_
rmetzger@apache.org
Munich Apache Flink Meetup,
November 11, 2015
A bit of history
From incubation until now
2
3
Apr 2014 Jun 2015Dec 2014
0.70.60.5 0.90.9-m1 0.10
Nov 2015
Top level
0.8
Community growth
Flink is one of the largest and most active Apache
big data projects with well over 120 contributors
4
Flink meetups around the globe
5
Organizations at Flink Forward
6
Featured in
7
The streaming era
Flink is all about
8
9
batch
event
based
need new systems
well served
10
Streaming is the biggest change in
data infrastructure since Hadoop
What is stream processing
 Real-world data is unbounded and is
pushed to systems
 Right now: people are using the batch
paradigm for stream analysis (there was
no good stream processor available)
 New systems (Flink, Kafka) embrace
streaming nature of data
11
Web server Kafka topic
Stream processing
12
Flink is a stream processor with many faces
Streaming dataflow runtime
Flink's streaming runtime
13
Requirements for a stream processor
 Low latency
• Fast results (milliseconds)
 High throughput
• handle large data amounts (millions of events
per second)
 Exactly-once guarantees
• Correct results, also in failure cases
 Programmability
• Intuitive APIs
14
Pipelining
15
Basic building block to “keep the data moving”
• Low latency
• Operators push
data forward
• Data shipping as
buffers, not tuple-
wise
• Natural handling
of back-pressure
Fault Tolerance in streaming
 at least once: ensure all operators see all
events
• Storm: Replay stream in failure case
 Exactly once: Ensure that operators do
not perform duplicate updates to their
state
• Flink: Distributed Snapshots
• Spark: Micro-batches on batch runtime
16
Flink’s Distributed Snapshots
 Lightweight approach of storing the state
of all operators without pausing the
execution
 high throughput, low latency
 Implemented using barriers flowing
through the topology
17
Kafka
Consumer
offset = 162
Element
Counter
value = 152
Operator
stateData Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
(backup till next snapshot)
DataStream API
18
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new EOFTrigger())
.sum("count")
Batch Word Count in the DataStream API
Batch Word Count
in the DataSet API
19
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.keyBy("word")
.window(GlobalWindows.create())
.trigger(new EOFTrigger())
.sum("count")
val text: DataSet[String] = …;
text
.flatMap { line => line.split(" ") }
.map { word => new WordCount(word, 1) }
.groupBy("word")
.sum("count")
Best of all worlds for streaming
 Low latency
• Thanks to pipelined engine
 Exactly-once guarantees
• Distributed Snapshots
 High throughput
• Controllable checkpointing overhead
 Programmability
• APIs similar to those known from the batch world
20
Throughput of distributed grep
21
Data
Generator
“grep”
operator
30 machines, 120 cores
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
Flink, no fault
tolerance
Flink, exactly
once (5s)
Storm, no
fault tolerance
Storm, micro-
batches
aggregate throughput
of 175 million
elements per second
aggregate throughput
of 9 million elements
per second
• Flink achieves 20x
higher throughput
• Flink throughput
almost the same
with and without
exactly-once
Aggregate throughput for stream record
grouping
22
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
90,000,000
100,000,000
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, no
fault
tolerance
Storm, at
least once
aggregate throughput
of 83 million elements
per second
8,6 million elements/s
309k elements/s  Flink achieves 260x
higher throughput with
fault tolerance
30 machines,
120 cores Network
transfer
Latency in stream record grouping
23
Data
Generator
Receiver:
Throughput /
Latency measure
• Measure time for a record to
travel from source to sink
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Flink, no
fault
tolerance
Flink, exactly
once
Storm, at
least once
Median latency
25 ms
1 ms
0.00
10.00
20.00
30.00
40.00
50.00
60.00
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, at
least
once
99th percentile
latency
50 ms
24
Exactly-Once with YARN Chaos Monkey
 Validate exactly-once guarantees with
state-machine
25
Performance: Summary
26
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With configurable throughput/latency tradeoff
“Faces” of Flink
27
Faces of a stream processor
28
Stream
processing
Batch
processing
Machine Learning at scale
Graph Analysis
Streaming dataflow runtime
The Flink Stack
29
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
The Flink Stack
30
Streaming dataflow runtime
DataSet (Java/Scala) DataStream (Java/Scala)
Experimental
Python API also
available
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
API independent Dataflow
Graph representation
Batch Optimizer Graph Builder
Batch is a special case of streaming
 Batch: run a bounded stream (data set) on
a stream processor
 Form a global window over the entire data
set for join or grouping operations
31
Batch-specific optimizations
 Managed memory on- and off-heap
• Operators (join, sort, …) with out-of-core
support
• Optimized serialization stack for user-types
 Cost-based Optimizer
• Job execution depends on data size
32
The Flink Stack
33
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
DataSet (Java/Scala) DataStream
FlinkML: Machine Learning
 API for ML pipelines inspired by scikit-learn
 Collection of packaged algorithms
• SVM, Multiple Linear Regression, Optimization, ALS, ...
34
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
pipeline.fit(trainingData)
val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
Gelly: Graph Processing
 Graph API and library
 Packaged algorithms
• PageRank, SSSP, Label Propagation, Community
Detection, Connected Components
35
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Graph<Long, Long, NullValue> graph = ...
DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run(
new LabelPropagation<Long>(30)).getVertices();
verticesWithCommunity.print();
env.execute();
Flink Stack += Gelly, ML
36
Gelly
ML
DataSet (Java/Scala) DataStream
Streaming dataflow runtime
Integration with other systems
37
SAMOA
DataSet DataStream
HadoopM/R
GoogleDataflow
Cascading
Storm
Zeppelin
• Use Hadoop Input/Output Formats
• Mapper / Reducer implementations
• Hadoop’s FileSystem implementations
• Run applications implemented against Google’s Data Flow API
on premise with Flink
• Run Cascading jobs on Flink, with almost no code change
• Benefit from Flink’s vastly better performance than
MapReduce
• Interactive, web-based data exploration
• Machine learning on data streams
• Compatibility layer for running Storm code
• FlinkTopologyBuilder: one line replacement for
existing jobs
• Wrappers for Storm Spouts and Bolts
• Coming soon: Exactly-once with Storm
• Build-in serializer support for
Hadoop’s Writable type
Deployment options
Gelly
Table
ML
SAMOA
DataSet(Java/Scala)DataStream
Hadoop
LocalClusterYARNTezEmbedded
Dataflow
Dataflow
MRQL
Table
Cascading
Streamingdataflowruntime
Storm
Zeppelin
• Start Flink in your IDE / on your machine
• Local debugging / development using the
same code as on the cluster
• “bare metal” standalone installation of Flink
on a cluster
• Flink on Hadoop YARN (Hadoop 2.2.0+)
• Restarts failed containers
• Support for Kerberos-secured YARN/HDFS
setups
The full stack
39
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Cluster Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading
Streaming dataflow runtime
Storm(WiP)
Zeppelin
Flink Use-cases and
production users
40
Bouygues Telecom
41
Bouygues Telecom
42
Bouygues Telecom
43
Capital One
44
45
Closing
46
What is currently happening?
 Features for upcoming 0.10 release:
• Master High Availability
• Vastly improved monitoring GUI
• Watermarks / Event time processing /
Windowing rework
 The next talk by Matthias
• Stable streaming API (no more “beta” flag)
 Next release: 1.0
• API stability
47
How do I get started?
48
Mailing Lists: (news | user | dev)@flink.apache.org
Twitter: @ApacheFlink
Blogs: flink.apache.org/blog, data-artisans.com/blog/
IRC channel: irc.freenode.net#flink
Start Flink on YARN in 4 commands:
# get the hadoop2 package from the Flink download page at
# http://flink.apache.org/downloads.html
wget <download url>
tar xvzf flink-0.9.1-bin-hadoop2.tgz
cd flink-0.9.1/
./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java-
examples-0.9.1-WordCount.jar
flink.apache.org 49
• Check out the slides: http://flink-
forward.org/?post_type=session
• Video recordings on YouTube, “Flink Forward”
channel
Appendix
50
51
52
53
54
Managed (off-heap) memory and out-of-
core support
55Memory runs out
Cost-based Optimizer
56
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Best plan
depends on
relative sizes
of input files
57
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
JobManager
TaskManagers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
LocalCluster:YARN,Standalone
Iterative processing in Flink
Flink offers built-in iterations and delta
iterations to execute ML and graph
algorithms efficiently
58
map
join sum
ID1
ID2
ID3
Example: Matrix Factorization
59
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
60
Batch
aggregation
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Blocked" result partition
61
Streaming
window
aggregation
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Pipelined" result partition

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4From Apache Flink® 1.3 to 1.4
From Apache Flink® 1.3 to 1.4
 
Baymeetup-FlinkResearch
Baymeetup-FlinkResearchBaymeetup-FlinkResearch
Baymeetup-FlinkResearch
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Data Analysis With Apache Flink
Data Analysis With Apache FlinkData Analysis With Apache Flink
Data Analysis With Apache Flink
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 

Semelhante a Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases

Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 

Semelhante a Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases (20)

Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 

Mais de Robert Metzger

Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
Robert Metzger
 
Flink Community Update February 2015
Flink Community Update February 2015Flink Community Update February 2015
Flink Community Update February 2015
Robert Metzger
 
Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.
Robert Metzger
 

Mais de Robert Metzger (16)

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Berlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community UpdateBerlin Apache Flink Meetup May 2015, Community Update
Berlin Apache Flink Meetup May 2015, Community Update
 
Flink Community Update April 2015
Flink Community Update April 2015Flink Community Update April 2015
Flink Community Update April 2015
 
Apache Flink Community Update March 2015
Apache Flink Community Update March 2015Apache Flink Community Update March 2015
Apache Flink Community Update March 2015
 
Flink Community Update February 2015
Flink Community Update February 2015Flink Community Update February 2015
Flink Community Update February 2015
 
Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.Compute "Closeness" in Graphs using Apache Giraph.
Compute "Closeness" in Graphs using Apache Giraph.
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Integrations and Use Cases

  • 1. Stream & Batch Processing with Apache Flink Robert Metzger @rmetzger_ rmetzger@apache.org Munich Apache Flink Meetup, November 11, 2015
  • 2. A bit of history From incubation until now 2
  • 3. 3 Apr 2014 Jun 2015Dec 2014 0.70.60.5 0.90.9-m1 0.10 Nov 2015 Top level 0.8
  • 4. Community growth Flink is one of the largest and most active Apache big data projects with well over 120 contributors 4
  • 5. Flink meetups around the globe 5
  • 8. The streaming era Flink is all about 8
  • 10. 10 Streaming is the biggest change in data infrastructure since Hadoop
  • 11. What is stream processing  Real-world data is unbounded and is pushed to systems  Right now: people are using the batch paradigm for stream analysis (there was no good stream processor available)  New systems (Flink, Kafka) embrace streaming nature of data 11 Web server Kafka topic Stream processing
  • 12. 12 Flink is a stream processor with many faces Streaming dataflow runtime
  • 14. Requirements for a stream processor  Low latency • Fast results (milliseconds)  High throughput • handle large data amounts (millions of events per second)  Exactly-once guarantees • Correct results, also in failure cases  Programmability • Intuitive APIs 14
  • 15. Pipelining 15 Basic building block to “keep the data moving” • Low latency • Operators push data forward • Data shipping as buffers, not tuple- wise • Natural handling of back-pressure
  • 16. Fault Tolerance in streaming  at least once: ensure all operators see all events • Storm: Replay stream in failure case  Exactly once: Ensure that operators do not perform duplicate updates to their state • Flink: Distributed Snapshots • Spark: Micro-batches on batch runtime 16
  • 17. Flink’s Distributed Snapshots  Lightweight approach of storing the state of all operators without pausing the execution  high throughput, low latency  Implemented using barriers flowing through the topology 17 Kafka Consumer offset = 162 Element Counter value = 152 Operator stateData Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot (backup till next snapshot)
  • 18. DataStream API 18 case class WordCount(word: String, count: Int) val text: DataStream[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .keyBy("word") .window(GlobalWindows.create()) .trigger(new EOFTrigger()) .sum("count") Batch Word Count in the DataStream API
  • 19. Batch Word Count in the DataSet API 19 case class WordCount(word: String, count: Int) val text: DataStream[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .keyBy("word") .window(GlobalWindows.create()) .trigger(new EOFTrigger()) .sum("count") val text: DataSet[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .groupBy("word") .sum("count")
  • 20. Best of all worlds for streaming  Low latency • Thanks to pipelined engine  Exactly-once guarantees • Distributed Snapshots  High throughput • Controllable checkpointing overhead  Programmability • APIs similar to those known from the batch world 20
  • 21. Throughput of distributed grep 21 Data Generator “grep” operator 30 machines, 120 cores 0 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 160,000,000 180,000,000 200,000,000 Flink, no fault tolerance Flink, exactly once (5s) Storm, no fault tolerance Storm, micro- batches aggregate throughput of 175 million elements per second aggregate throughput of 9 million elements per second • Flink achieves 20x higher throughput • Flink throughput almost the same with and without exactly-once
  • 22. Aggregate throughput for stream record grouping 22 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 80,000,000 90,000,000 100,000,000 Flink, no fault tolerance Flink, exactly once Storm, no fault tolerance Storm, at least once aggregate throughput of 83 million elements per second 8,6 million elements/s 309k elements/s  Flink achieves 260x higher throughput with fault tolerance 30 machines, 120 cores Network transfer
  • 23. Latency in stream record grouping 23 Data Generator Receiver: Throughput / Latency measure • Measure time for a record to travel from source to sink 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Flink, no fault tolerance Flink, exactly once Storm, at least once Median latency 25 ms 1 ms 0.00 10.00 20.00 30.00 40.00 50.00 60.00 Flink, no fault tolerance Flink, exactly once Storm, at least once 99th percentile latency 50 ms
  • 24. 24
  • 25. Exactly-Once with YARN Chaos Monkey  Validate exactly-once guarantees with state-machine 25
  • 28. Faces of a stream processor 28 Stream processing Batch processing Machine Learning at scale Graph Analysis Streaming dataflow runtime
  • 29. The Flink Stack 29 Streaming dataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment
  • 30. The Flink Stack 30 Streaming dataflow runtime DataSet (Java/Scala) DataStream (Java/Scala) Experimental Python API also available Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward API independent Dataflow Graph representation Batch Optimizer Graph Builder
  • 31. Batch is a special case of streaming  Batch: run a bounded stream (data set) on a stream processor  Form a global window over the entire data set for join or grouping operations 31
  • 32. Batch-specific optimizations  Managed memory on- and off-heap • Operators (join, sort, …) with out-of-core support • Optimized serialization stack for user-types  Cost-based Optimizer • Job execution depends on data size 32
  • 33. The Flink Stack 33 Streaming dataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment DataSet (Java/Scala) DataStream
  • 34. FlinkML: Machine Learning  API for ML pipelines inspired by scikit-learn  Collection of packaged algorithms • SVM, Multiple Linear Regression, Optimization, ALS, ... 34 val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) pipeline.fit(trainingData) val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
  • 35. Gelly: Graph Processing  Graph API and library  Packaged algorithms • PageRank, SSSP, Label Propagation, Community Detection, Connected Components 35 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); Graph<Long, Long, NullValue> graph = ... DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run( new LabelPropagation<Long>(30)).getVertices(); verticesWithCommunity.print(); env.execute();
  • 36. Flink Stack += Gelly, ML 36 Gelly ML DataSet (Java/Scala) DataStream Streaming dataflow runtime
  • 37. Integration with other systems 37 SAMOA DataSet DataStream HadoopM/R GoogleDataflow Cascading Storm Zeppelin • Use Hadoop Input/Output Formats • Mapper / Reducer implementations • Hadoop’s FileSystem implementations • Run applications implemented against Google’s Data Flow API on premise with Flink • Run Cascading jobs on Flink, with almost no code change • Benefit from Flink’s vastly better performance than MapReduce • Interactive, web-based data exploration • Machine learning on data streams • Compatibility layer for running Storm code • FlinkTopologyBuilder: one line replacement for existing jobs • Wrappers for Storm Spouts and Bolts • Coming soon: Exactly-once with Storm • Build-in serializer support for Hadoop’s Writable type
  • 38. Deployment options Gelly Table ML SAMOA DataSet(Java/Scala)DataStream Hadoop LocalClusterYARNTezEmbedded Dataflow Dataflow MRQL Table Cascading Streamingdataflowruntime Storm Zeppelin • Start Flink in your IDE / on your machine • Local debugging / development using the same code as on the cluster • “bare metal” standalone installation of Flink on a cluster • Flink on Hadoop YARN (Hadoop 2.2.0+) • Restarts failed containers • Support for Kerberos-secured YARN/HDFS setups
  • 39. The full stack 39 Gelly Table ML SAMOA DataSet (Java/Scala) DataStream HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading Streaming dataflow runtime Storm(WiP) Zeppelin
  • 45. 45
  • 47. What is currently happening?  Features for upcoming 0.10 release: • Master High Availability • Vastly improved monitoring GUI • Watermarks / Event time processing / Windowing rework  The next talk by Matthias • Stable streaming API (no more “beta” flag)  Next release: 1.0 • API stability 47
  • 48. How do I get started? 48 Mailing Lists: (news | user | dev)@flink.apache.org Twitter: @ApacheFlink Blogs: flink.apache.org/blog, data-artisans.com/blog/ IRC channel: irc.freenode.net#flink Start Flink on YARN in 4 commands: # get the hadoop2 package from the Flink download page at # http://flink.apache.org/downloads.html wget <download url> tar xvzf flink-0.9.1-bin-hadoop2.tgz cd flink-0.9.1/ ./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java- examples-0.9.1-WordCount.jar
  • 49. flink.apache.org 49 • Check out the slides: http://flink- forward.org/?post_type=session • Video recordings on YouTube, “Flink Forward” channel
  • 51. 51
  • 52. 52
  • 53. 53
  • 54. 54
  • 55. Managed (off-heap) memory and out-of- core support 55Memory runs out
  • 56. Cost-based Optimizer 56 DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine GroupRed sort DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forward Best plan depends on relative sizes of input files
  • 57. 57 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) JobManager TaskManagers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results LocalCluster:YARN,Standalone
  • 58. Iterative processing in Flink Flink offers built-in iterations and delta iterations to execute ML and graph algorithms efficiently 58 map join sum ID1 ID2 ID3
  • 59. Example: Matrix Factorization 59 Factorizing a matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html

Notas do Editor

  1. There are two different data processing problems. The first is when you have collected some data and want to make some sense from it, for example build a report. The second is when you want to react to events that are happening in the world, for example a credit card transaction. And these two are fundamentally different: the first one is akin to polling, while the second one is push-based. There are many tools for the first kind of problem, including Apache Flink. However, while the second is as important and common as the first, for a long time we have not had good tools to deal with it. This is now changing.
  2. GCE 30 instances with 4 cores and 15 GB of memory each. Flink master from July, 24th, Storm 0.9.3. All the code used for the evaluation can be found here. Flink 1.5 million elements per second per core Aggregate Throughput in cluster 182 million elements per second. Storm 82,000 elements per second per core Aggregate 0.57 million elements per second Storm with Acknowledge 4,700 elements per second per core, Latency 30-120 milliseconds Trident: 75,000 elements per second per core
  3. Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  4. Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  5. Flink 0 Buffer timeout: latency median 0 msec, 99 %tile 20 msec 24,500 events per second per core
  6. People previously made the case that high throughput and low latency are mutually exclusive