SlideShare uma empresa Scribd logo
1 de 38
Stream processing with
Apache Flink™
Kostas Tzoumas
@kostas_tzoumas
The rise of stream processing
2
Why streaming
3
Data
Warehouse
Batch
Data availability Streaming
- Strict schema
- Load rate
- BI access
- Some schema
- Load rate
- Programmable
- Some schema
- Ingestion rate
- Programmable
2008 20152000
- Which data?
- When?
- Who?
What does streaming enable?
1. Data integration 2. Low latency applications
4
• Fresh recommendations,
fraud detection, etc
• Internet of Things, intelligent
manufacturing
• Results “right here, right now”
cf. Kleppmann: "Turning the DB
inside out with Samza"
3. Batch < Streaming
New stack next to/inside Hadoop
5
Files
Batch
processors
High-latency
apps
Event streams
Stream
processors
Low-latency
apps
Streaming data architectures
6
Stream platform architecture
7
- Gather and backup streams
- Offer streams for consumption
- Provide stream recovery
- Analyze and correlate streams
- Create derived streams and state
- Provide these to upstream systems
Server
logs
Trxn
logs
Sensor
logs
Upstream
systems
Example: Bouygues Telecom
8
Apache Flink primer
9
What is Flink
10
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Cluster Yarn
Tez
Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow
runtime
Storm(WiP)
Zeppelin
Motivation for Flink
11
An engine that can natively support all these workloads.
Flink
Stream
processing
Batch
processing
Machine Learning at scale
Graph Analysis
Stream processing in Flink
12
What is a stream processor?
1. Pipelining
2. Stream replay
3. Operator state
4. Backup and restore
5. High-level APIs
6. Integration with batch
7. High availability
8. Scale-in and scale-out
13
Basics
State
App development
Large deployments
See http://data-artisans.com/stream-processing-with-flink.html
Pipelining
14
Basic building block to “keep the data moving”
Note: pipelined systems do not
usually transfer individual tuples,
but buffers that batch several tuples!
Operator state
 User-defined state
• Flink transformations (map/reduce/etc) are long-running operators, feel
free to keep around objects
• Hooks to include in system's checkpoint
 Windowed streams
• Time, count, data-driven windows
• Managed by the system (currently WiP)
 Managed state (WiP)
• State interface for operators
• Backed up and restored by the system with pluggable state backend
(HDFS, Ignite, Cassandra, …)
15
Streaming fault tolerance
 Ensure that operators see all events
• “At least once”
• Solved by replaying a stream from a checkpoint,
e.g., from a past Kafka offset
 Ensure that operators do not perform
duplicate updates to their state
• “Exactly once”
• Several solutions
16
Exactly once approaches
 Discretized streams (Spark Streaming)
• Treat streaming as a series of small atomic computations
• “Fast track” to fault tolerance, but does not separate business
logic from recovery
 MillWheel (Google Cloud Dataflow)
• State update and derived events committed as atomic
transaction to a high-throughput transactional store
• Needs a very high-throughput transactional store 
 Chandy-Lamport distributed snapshots (Flink)
17
Distributed snapshots in Flink
Super-impose checkpointing mechanism on
execution instead of using execution as the
checkpointing mechanism
18
19
JobManager
Register checkpoint
barrier on master
Replay will start from here
20
JobManagerBarriers “push” prior events
(assumes in-order delivery in
individual channels)
Operator checkpointing
starting
Operator checkpointing
finished
Operator checkpointing in
progress
21
JobManager Operator checkpointing takes
snapshot of state after data
prior to barrier have updated
the state. Checkpoints
currently one-off and
synchronous, WiP for
incremental and
asynchronous
State backup
Pluggable mechanism. Currently
either JobManager (for small state) or
file system (HDFS/Tachyon). WiP for
in-memory grids
22
JobManager
Operators with many inputs
need to wait for all barriers to
pass before they checkpoint
their state
23
JobManager
State snapshots at sinks
signal successful end of this
checkpoint
At failure,
recover last
checkpointed
state and
restart
sources from
last barrier
guarantees at
least once
State backup
Benefits of Flink’s approach
 Data processing does not block
• Can checkpoint at any interval you like to balance overhead/recovery
time
 Separates business logic from recovery
• Checkpointing interval is a config parameter, not a variable in the
program (as in discretization)
 Can support richer windows
• Session windows, event time, etc
 Best of all worlds: true streaming latency, exactly-once semantics,
and low overhead for recovery
24
DataStream API
25
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Roadmap
 Short-term (3-6 months)
• Graduate DataStream API from beta
• Fully managed window and user-defined state with pluggable
backends
• Table API for streams (towards StreamSQL)
 Long-term (6+ months)
• Highly available master
• Dynamic scale in/out
• FlinkML and Gelly for streams
• Full batch + stream unification
26
Closing
27
tl;dr: what was this about?
 Streaming is the next logical step in data infrastructure
 Many new "fast data" platforms are being built next to or
inside Hadoop – will need a stream processor
 The case for Flink as a stream processor
• Proper engine foundation
• Attractive APIs and libraries
• Integration with batch
• Large (and growing!) community
28
Apache Flink: community
29
One of the most active big
data projects after one year
in the Apache Software
Foundation
I Flink, do you? 
30
If you find this exciting,
get involved and start a discussion on Flink‘s mailing list,
or stay tuned by
subscribing to news@flink.apache.org,
following flink.apache.org/blog, and
@ApacheFlink on Twitter
31
flink-forward.org
Spark & Friends meetup
June 16
Bay Area Flink meetup
June 17
Appendix
32
Discretized streams
33
Job Job Job
state
logical result
stream
input
stream
while (true) {
// get next X seconds of data
// compute next stream and state
}
Unit of fault tolerance is
mini-batch
Problems of mini-batch
 Latency
• Each mini-batch schedules a new job, loads user libraries,
establishes DB connections, etc
 Programming model
• Does not separate business logic from recovery –
changing the mini-batch size changes query results
 Power
• Keeping and updating state across mini-batches only
possible by immutable computations
34
Windowing
35
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Integration with batch
 Currently cannot mix DataSet & DataStream programs
 However, DataStream programs can read batch sources, they
are just finite streams 
 Goal is to evolve DataStream to a batch/stream-agnostic API
36
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Streaming dataflow runtime
E.g.: Non-native iterations
37
Step Step Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
E.g.: Non-native streaming
38
stream
discretizer
Job Job Job Jobwhile (true) {
// get next few records
// issue batch job
}

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 

Destaque

Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
DataWorks Summit
 

Destaque (20)

Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Dell Cloud Manager Overview
Dell Cloud Manager OverviewDell Cloud Manager Overview
Dell Cloud Manager Overview
 
Dell Storage Management
Dell Storage ManagementDell Storage Management
Dell Storage Management
 
Dell Networking Wired, Wireless and Security Solutions Lab
Dell Networking Wired, Wireless and Security Solutions LabDell Networking Wired, Wireless and Security Solutions Lab
Dell Networking Wired, Wireless and Security Solutions Lab
 
John Kenevey, Open Compute "Open Compute Project: history, value proposition...
John Kenevey, Open Compute  "Open Compute Project: history, value proposition...John Kenevey, Open Compute  "Open Compute Project: history, value proposition...
John Kenevey, Open Compute "Open Compute Project: history, value proposition...
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
 

Semelhante a Flexible and Real-Time Stream Processing with Apache Flink

Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas
 

Semelhante a Flexible and Real-Time Stream Processing with Apache Flink (20)

Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
 
Flink 0.10 - Upcoming Features
Flink 0.10 - Upcoming FeaturesFlink 0.10 - Upcoming Features
Flink 0.10 - Upcoming Features
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 

Mais de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Flexible and Real-Time Stream Processing with Apache Flink

  • 1. Stream processing with Apache Flink™ Kostas Tzoumas @kostas_tzoumas
  • 2. The rise of stream processing 2
  • 3. Why streaming 3 Data Warehouse Batch Data availability Streaming - Strict schema - Load rate - BI access - Some schema - Load rate - Programmable - Some schema - Ingestion rate - Programmable 2008 20152000 - Which data? - When? - Who?
  • 4. What does streaming enable? 1. Data integration 2. Low latency applications 4 • Fresh recommendations, fraud detection, etc • Internet of Things, intelligent manufacturing • Results “right here, right now” cf. Kleppmann: "Turning the DB inside out with Samza" 3. Batch < Streaming
  • 5. New stack next to/inside Hadoop 5 Files Batch processors High-latency apps Event streams Stream processors Low-latency apps
  • 7. Stream platform architecture 7 - Gather and backup streams - Offer streams for consumption - Provide stream recovery - Analyze and correlate streams - Create derived streams and state - Provide these to upstream systems Server logs Trxn logs Sensor logs Upstream systems
  • 10. What is Flink 10 Gelly Table ML SAMOA DataSet (Java/Scala) DataStream (Java/Scala) HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime Storm(WiP) Zeppelin
  • 11. Motivation for Flink 11 An engine that can natively support all these workloads. Flink Stream processing Batch processing Machine Learning at scale Graph Analysis
  • 13. What is a stream processor? 1. Pipelining 2. Stream replay 3. Operator state 4. Backup and restore 5. High-level APIs 6. Integration with batch 7. High availability 8. Scale-in and scale-out 13 Basics State App development Large deployments See http://data-artisans.com/stream-processing-with-flink.html
  • 14. Pipelining 14 Basic building block to “keep the data moving” Note: pipelined systems do not usually transfer individual tuples, but buffers that batch several tuples!
  • 15. Operator state  User-defined state • Flink transformations (map/reduce/etc) are long-running operators, feel free to keep around objects • Hooks to include in system's checkpoint  Windowed streams • Time, count, data-driven windows • Managed by the system (currently WiP)  Managed state (WiP) • State interface for operators • Backed up and restored by the system with pluggable state backend (HDFS, Ignite, Cassandra, …) 15
  • 16. Streaming fault tolerance  Ensure that operators see all events • “At least once” • Solved by replaying a stream from a checkpoint, e.g., from a past Kafka offset  Ensure that operators do not perform duplicate updates to their state • “Exactly once” • Several solutions 16
  • 17. Exactly once approaches  Discretized streams (Spark Streaming) • Treat streaming as a series of small atomic computations • “Fast track” to fault tolerance, but does not separate business logic from recovery  MillWheel (Google Cloud Dataflow) • State update and derived events committed as atomic transaction to a high-throughput transactional store • Needs a very high-throughput transactional store   Chandy-Lamport distributed snapshots (Flink) 17
  • 18. Distributed snapshots in Flink Super-impose checkpointing mechanism on execution instead of using execution as the checkpointing mechanism 18
  • 19. 19 JobManager Register checkpoint barrier on master Replay will start from here
  • 20. 20 JobManagerBarriers “push” prior events (assumes in-order delivery in individual channels) Operator checkpointing starting Operator checkpointing finished Operator checkpointing in progress
  • 21. 21 JobManager Operator checkpointing takes snapshot of state after data prior to barrier have updated the state. Checkpoints currently one-off and synchronous, WiP for incremental and asynchronous State backup Pluggable mechanism. Currently either JobManager (for small state) or file system (HDFS/Tachyon). WiP for in-memory grids
  • 22. 22 JobManager Operators with many inputs need to wait for all barriers to pass before they checkpoint their state
  • 23. 23 JobManager State snapshots at sinks signal successful end of this checkpoint At failure, recover last checkpointed state and restart sources from last barrier guarantees at least once State backup
  • 24. Benefits of Flink’s approach  Data processing does not block • Can checkpoint at any interval you like to balance overhead/recovery time  Separates business logic from recovery • Checkpointing interval is a config parameter, not a variable in the program (as in discretization)  Can support richer windows • Session windows, event time, etc  Best of all worlds: true streaming latency, exactly-once semantics, and low overhead for recovery 24
  • 25. DataStream API 25 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 26. Roadmap  Short-term (3-6 months) • Graduate DataStream API from beta • Fully managed window and user-defined state with pluggable backends • Table API for streams (towards StreamSQL)  Long-term (6+ months) • Highly available master • Dynamic scale in/out • FlinkML and Gelly for streams • Full batch + stream unification 26
  • 28. tl;dr: what was this about?  Streaming is the next logical step in data infrastructure  Many new "fast data" platforms are being built next to or inside Hadoop – will need a stream processor  The case for Flink as a stream processor • Proper engine foundation • Attractive APIs and libraries • Integration with batch • Large (and growing!) community 28
  • 29. Apache Flink: community 29 One of the most active big data projects after one year in the Apache Software Foundation
  • 30. I Flink, do you?  30 If you find this exciting, get involved and start a discussion on Flink‘s mailing list, or stay tuned by subscribing to news@flink.apache.org, following flink.apache.org/blog, and @ApacheFlink on Twitter
  • 31. 31 flink-forward.org Spark & Friends meetup June 16 Bay Area Flink meetup June 17
  • 33. Discretized streams 33 Job Job Job state logical result stream input stream while (true) { // get next X seconds of data // compute next stream and state } Unit of fault tolerance is mini-batch
  • 34. Problems of mini-batch  Latency • Each mini-batch schedules a new job, loads user libraries, establishes DB connections, etc  Programming model • Does not separate business logic from recovery – changing the mini-batch size changes query results  Power • Keeping and updating state across mini-batches only possible by immutable computations 34
  • 36. Integration with batch  Currently cannot mix DataSet & DataStream programs  However, DataStream programs can read batch sources, they are just finite streams   Goal is to evolve DataStream to a batch/stream-agnostic API 36 DataSet (Java/Scala/Python) DataStream (Java/Scala) Streaming dataflow runtime
  • 37. E.g.: Non-native iterations 37 Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }
  • 38. E.g.: Non-native streaming 38 stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch job }

Notas do Editor

  1. What are the technologies that enable streaming? The open source leaders in this space is Apache Kafka (that solves the integration problem), and Apache Flink (that solves the analytics problem, removing the final barrier). Kafka and Flink combined can remove the batch barriers from the infrastructure, creating a truly real-time analytics platform.
  2. Other data points Google (cloud dataflow) Hortonworks Cloudera Adatao Concurrent Confluent We have been part of this open source movement with Apache Flink. Flink is a streaming dataflow engine that can run in Hadoop clusters. Flink has grown a lot over the past year both in terms of code and community. We have added domain-specific libraries, a streaming API with streaming backend support, etc, etc. Tremendous growth. Flink has also grown in community. The project is by now a very established Apache project, it has more than 140 contributors (placing it at the top 5 of Apache big data projects), and several companies are starting to experiment with it. At data Artisans we are supporting two production installations (ResearchGate and Bouygues Telecom), and are helping a number of companies that are testing Flink (e.g., Spotify, King.com, Amadeus, and a group at Yahoo). Huawei and Intel have started contributing to Flink, and interest in vendors is picking up (e.g., Adatao, Huawei, Hadoop vendors). All of this is the result of purely organic growth with very little marketing investment from data Artisans.