SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
@s_kontopoulos
Streaming Analytics: State of The Art
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare: stavroskontopoulos
stavroskontopoulos
@s_kontopoulos
Agenda
- Streaming Analytics - What & Why & How
- Streaming Platforms - Streaming Engines
- Code examples & Demo
3
@s_kontopoulos
Insights
Data Insight: A conclusion, a piece of information which can be used to make
actions and optimize a decision-making process.
Customer Insight: A non-obvious understanding about your customers, which if
acted upon, has the potential to change their behaviour for mutual benefit
Customer insight, Wikipedia
DAT
A
INFO INSIGHTS ACTIONS
4
@s_kontopoulos
The Gap
5
DATA INSIGHTS
@s_kontopoulos
Streaming Analytics - Bridging the Gap
Collect Analyze
Data Output Flow
(alarms, visualizations, ML scoring, etc)Data Input Flow
(sensors, mobile apps, etc)
Permanent Store
DATA FLOW
6
@s_kontopoulos
Streaming Analytics
“Streaming Analytics is the acquisition and analysis of the data at the moment it
streams into the system. It is a process done in a near real-time(NRT) fashion and
analysis results trigger specific actions for the system to execute.“
● No constraints or deadlines in the way they exist in RT systems
● Processing delay (end-to-end) varies and depends on the application ( < 1 ms
to minutes)
7
@s_kontopoulos
Big Data vs Fast Data
● Data in motion is the key characteristic.
● Fast Data is the new Big Data!
8
Two categories of systems: batch vs
streaming systems.
@s_kontopoulos
Common Use Cases
9
Image: Lightbend Inc.
@s_kontopoulos
Speed?
10
Image: Lightbend Inc.
@s_kontopoulos
Batch Data Pipeline
11
Analysis
New Data Batch View
Traditional MapReduce
paradigm
Image: Lightbend Inc.
@s_kontopoulos
Streaming Data Pipeline
In memory processing as data flows...
12
Analysis
New Data NR-Time View
Apache Flink
Akka Streams
Streaming Platform
Apache Kafka
Streams
@s_kontopoulos
Streaming Platforms
Its an ecosystem/environment that supports building and running streaming
applications. At its core it uses a streaming engine. Example of tools:
● A durable pub/sub component to fetch or store data
● A streaming engine
● A registry for storing data metadata like the data format etc.
13
@s_kontopoulos
Streaming Platforms - Some Examples
- Fast Data Platform (https://www.lightbend.com/products/fast-data-platform)
- Confluent Enterprise (https://www.confluent.io/product/confluent-platform)
- Da-Platform-2 (https://data-artisans.com/da-platform-2)
- Databricks Platform (https://databricks.com/product/unified-analytics-platform)
- IBM Streams
(https://www.ibm.com/analytics/us/en/technology/stream-computing/
- MapR Streams (https://mapr.com/products/mapr-streams/)
- Pravega (http://pravega.io)
...
14
@s_kontopoulos
Streaming Engine - the Core
A streaming engine provides the basic capabilities for developing and deploying
streaming applications.
Some systems like Kafka Streams or Akka Streams which
are just libraries don’t cover deployment effectively.
15
@s_kontopoulos
Streaming Engine - Key Features I
● Fault - Tolerance
● Processing Guarantees
● Checkpointing
● Streaming SQL
● Batch - Streaming API
● Language Integration (Python, Java, Scala, R)
● Stateful Management, User Session state
● Locality Awareness
● Backpressure
16
@s_kontopoulos
Streaming Engine - Key Features II
● Multi-Scheduler Support: Yarn, Mesos, Kubernetes
● Micro batching vs Data Flow
● ML, Graph, CEP
● Connectors (Sources, Sinks)
● Memory - Disk management (shuffling)
● Security (Kerberos etc)
17
@s_kontopoulos
DataFlow Execution Model
User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG. The data-sets are considered as immutable
distributed data. DAG is shipped to nodes where the data lie, computation is
executed and results are sent back to the user.
18
Spark Model
example
Flink model - FLIP 6
@s_kontopoulos
Streaming Engine - Which one to choose?
19
Some engines to consider...
Image: Lightbend Inc.
@s_kontopoulos
The Modern Enterprise Fast Data Architecture
20
Infrastructure (on premise, cloud)
Cluster scheduler
(Yarn, Standalone, Kubernetes, Mesos)
Fast Data
Apps
Micro
Services
ML
Operations
Monitoring
Security
Governance
Permanent
Storage
(HDFS, S3...)
Streaming Platform
(pub/sub, streaming engine, etc)
BI Data Lake
@s_kontopoulos
Example Fast Data Architecture for the Enterprise
21Image: Lightbend Inc.
@s_kontopoulos
Analyzing Data Streams
Processing infinite data streams imposes certain restrictions compared to batch
processing:
- We may need to trade-off accuracy with space and time costs eg use
approximate algorithms or sketches eg count-min for summarizing stream
data.
- Streaming jobs require to operate 24/7 and need to be able to adapt to code
changes, failures and load variance.
22
@s_kontopoulos
Analyzing Data Streams
● Data flows from one or more sources through the engine and is written to one
or more sinks.
● Two cases for processing:
○ Single event processing: event transformation, trigger an alarm on an error event
○ Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
23
@s_kontopoulos
Analyzing Data Streams
● Event aggregation introduces the concept of windowing wrt to the notion of
time selected:
○ Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
○ Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
○ System Arrival or Ingestion time (the time that events arrived at the streaming system).
● Ideally event time = processing time. Reality is: there is skew.
24
@s_kontopoulos
Analyzing Data Streams
● Windows come in different flavors:
○ Tumbling windows discretize a stream into non-overlapping windows.
○ Sliding Windows: slide over the stream of data.
25
Images:https://flink.apache.org/news/2015/12/0
4/Introducing-windows.html
@s_kontopoulos
Analyzing Data Streams
● Watermarks: indicates that no elements with a timestamp older or equal to
the watermark timestamp should arrive for the specific window of data. Marks
the progress of the event time.
● Triggers: decide when the window is evaluated or purged. Affect latency &
state kept.
● Late data: provide a threshold for how late data can be compared to current
watermark value.
26
@s_kontopoulos
Analyzing Data Streams
● Recent advances (like the concept of watermarks etc) in Streaming are a
result of the pioneer work:
○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB
2013.
○ The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803
● The-world-beyond-batch-streaming-101 (Tyler Akidau)
● The-world-beyond-batch-streaming-102 (Tyler Akidau)
27
@s_kontopoulos
Analyzing Data Streams
● Apache Beam is the open source successor of Google’s
DataFlow.
● Provides the advanced semantics needed for the current
needs in streaming applications.
● Google DataFlow, Apache Flink, Apache Spark follow
that model.
(https://beam.apache.org/documentation/runners/capabili
ty-matrix)
28
@s_kontopoulos
Streams meet distributed log - I
Streams fit naturally to the idea of the distributed log (eg. Kafka streams integrated
with Kafka or Dell/EMC’s Pravega* uses stream as a storage primitive on top of
Apache Bookeeper).
*Pravega is an open-source streaming storage system.
29
@s_kontopoulos
Streams meet distributed log - II
Distributed log possible use cases:
● Implement external services (micro-services)
● Implement internal operations (eg. kafka streams shuffling, fault-tolerance)
30
@s_kontopoulos
Processing Guarantees
Many things can go wrong…
● At-most once
● At-least once
● Exactly once
What are the boundaries?
Within the streaming engine?
How about end-to-end including sources and sinks?
How about side effects like calling an external service?
31
@s_kontopoulos
Table Stream Duality
Stream table : The aggregation of a stream of updates over time yields a
table.
Table stream: The observation of changes to a table over time yields a
stream.
Why is this useful?
32
@s_kontopoulos
Streaming SQL Queries
Semantics ? How we define a join on an unbounded stream? Table join?
There is a joint work from:
https://docs.google.com/document/d/1wrla8mF_mmq-NW9sdJHYVgMyZsgCmHu
mJJ5f5WUzTiM/
33
Apache Flink
@s_kontopoulos
34
Streaming Applications - Spark Structured Streaming API
create spark session and read
from kafka topic
@s_kontopoulos
Streaming Applications - Spark Structured Streaming API
35
sensor metadata
emit complete output for every
window update based on
event-time to console. Setup a
trigger.
@s_kontopoulos
Streaming Applications - Flink Streaming API
36
custom source
initial sensor values
@s_kontopoulos
Streaming Applications - Flink Streaming API
37
watermark generation
create some random data
@s_kontopoulos
Streaming Applications - Flink Streaming API
38
create a windowed keyed
stream
apply a function per window
@s_kontopoulos
Kafka Streams vs Beam Model
- Trigger is more of an operational aspect compared to business parameters
like the window length. How often do I update my computation (affecting
latency and state size) is a non-functional requirement.
- A Table covers both the case of immutable data and the case of updatable
data.
39
@s_kontopoulos
Kafka Streams vs Beam Model
KTable<Windowed<String>, Long> aggregated = inputStream
.groupByKey()
.reduce((aggValue, newValue) -> aggValue + newValue,
TimeWindows.of(TimeUnit.MINUTES.toMillis(2))
.until(TimeUnit.DAYS.toMillis(1) /* keep for one day */),
"queryStoreName");
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100 /* milliseconds */);
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L);
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model
40
@s_kontopoulos
41
Source code: http://bit.ly/2yhDCeN
@s_kontopoulos
Thank you! Questions?
42
https://github.com/skonto/talks/blob/master/big-data-italy-2017/streaming
-analytics/references.md

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubesmister_zed
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Accelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeAccelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeDatabricks
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 

Mais procurados (20)

Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
A deep dive into neuton
A deep dive into neutonA deep dive into neuton
A deep dive into neuton
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Accelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks RuntimeAccelerating Machine Learning on Databricks Runtime
Accelerating Machine Learning on Databricks Runtime
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 

Semelhante a Streaming analytics state of the art

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapWithTheBest
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
 
WSO2 Complex Event Processor - Product Overview
WSO2 Complex Event Processor - Product OverviewWSO2 Complex Event Processor - Product Overview
WSO2 Complex Event Processor - Product OverviewWSO2
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulatorgystell
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...In-Memory Computing Summit
 

Semelhante a Streaming analytics state of the art (20)

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
WSO2 Complex Event Processor - Product Overview
WSO2 Complex Event Processor - Product OverviewWSO2 Complex Event Processor - Product Overview
WSO2 Complex Event Processor - Product Overview
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Distributed Systems in Data Engineering
Distributed Systems in Data EngineeringDistributed Systems in Data Engineering
Distributed Systems in Data Engineering
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Traffic Simulator
Traffic SimulatorTraffic Simulator
Traffic Simulator
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 

Mais de Stavros Kontopoulos

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...Stavros Kontopoulos
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkStavros Kontopoulos
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Stavros Kontopoulos
 

Mais de Stavros Kontopoulos (7)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on Flink
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Último

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 

Último (20)

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

Streaming analytics state of the art

  • 1. @s_kontopoulos Streaming Analytics: State of The Art Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  • 2. @s_kontopoulos Who am I? 2 skonto s_kontopoulos S. Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare: stavroskontopoulos stavroskontopoulos
  • 3. @s_kontopoulos Agenda - Streaming Analytics - What & Why & How - Streaming Platforms - Streaming Engines - Code examples & Demo 3
  • 4. @s_kontopoulos Insights Data Insight: A conclusion, a piece of information which can be used to make actions and optimize a decision-making process. Customer Insight: A non-obvious understanding about your customers, which if acted upon, has the potential to change their behaviour for mutual benefit Customer insight, Wikipedia DAT A INFO INSIGHTS ACTIONS 4
  • 6. @s_kontopoulos Streaming Analytics - Bridging the Gap Collect Analyze Data Output Flow (alarms, visualizations, ML scoring, etc)Data Input Flow (sensors, mobile apps, etc) Permanent Store DATA FLOW 6
  • 7. @s_kontopoulos Streaming Analytics “Streaming Analytics is the acquisition and analysis of the data at the moment it streams into the system. It is a process done in a near real-time(NRT) fashion and analysis results trigger specific actions for the system to execute.“ ● No constraints or deadlines in the way they exist in RT systems ● Processing delay (end-to-end) varies and depends on the application ( < 1 ms to minutes) 7
  • 8. @s_kontopoulos Big Data vs Fast Data ● Data in motion is the key characteristic. ● Fast Data is the new Big Data! 8 Two categories of systems: batch vs streaming systems.
  • 11. @s_kontopoulos Batch Data Pipeline 11 Analysis New Data Batch View Traditional MapReduce paradigm Image: Lightbend Inc.
  • 12. @s_kontopoulos Streaming Data Pipeline In memory processing as data flows... 12 Analysis New Data NR-Time View Apache Flink Akka Streams Streaming Platform Apache Kafka Streams
  • 13. @s_kontopoulos Streaming Platforms Its an ecosystem/environment that supports building and running streaming applications. At its core it uses a streaming engine. Example of tools: ● A durable pub/sub component to fetch or store data ● A streaming engine ● A registry for storing data metadata like the data format etc. 13
  • 14. @s_kontopoulos Streaming Platforms - Some Examples - Fast Data Platform (https://www.lightbend.com/products/fast-data-platform) - Confluent Enterprise (https://www.confluent.io/product/confluent-platform) - Da-Platform-2 (https://data-artisans.com/da-platform-2) - Databricks Platform (https://databricks.com/product/unified-analytics-platform) - IBM Streams (https://www.ibm.com/analytics/us/en/technology/stream-computing/ - MapR Streams (https://mapr.com/products/mapr-streams/) - Pravega (http://pravega.io) ... 14
  • 15. @s_kontopoulos Streaming Engine - the Core A streaming engine provides the basic capabilities for developing and deploying streaming applications. Some systems like Kafka Streams or Akka Streams which are just libraries don’t cover deployment effectively. 15
  • 16. @s_kontopoulos Streaming Engine - Key Features I ● Fault - Tolerance ● Processing Guarantees ● Checkpointing ● Streaming SQL ● Batch - Streaming API ● Language Integration (Python, Java, Scala, R) ● Stateful Management, User Session state ● Locality Awareness ● Backpressure 16
  • 17. @s_kontopoulos Streaming Engine - Key Features II ● Multi-Scheduler Support: Yarn, Mesos, Kubernetes ● Micro batching vs Data Flow ● ML, Graph, CEP ● Connectors (Sources, Sinks) ● Memory - Disk management (shuffling) ● Security (Kerberos etc) 17
  • 18. @s_kontopoulos DataFlow Execution Model User defines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. The data-sets are considered as immutable distributed data. DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. 18 Spark Model example Flink model - FLIP 6
  • 19. @s_kontopoulos Streaming Engine - Which one to choose? 19 Some engines to consider... Image: Lightbend Inc.
  • 20. @s_kontopoulos The Modern Enterprise Fast Data Architecture 20 Infrastructure (on premise, cloud) Cluster scheduler (Yarn, Standalone, Kubernetes, Mesos) Fast Data Apps Micro Services ML Operations Monitoring Security Governance Permanent Storage (HDFS, S3...) Streaming Platform (pub/sub, streaming engine, etc) BI Data Lake
  • 21. @s_kontopoulos Example Fast Data Architecture for the Enterprise 21Image: Lightbend Inc.
  • 22. @s_kontopoulos Analyzing Data Streams Processing infinite data streams imposes certain restrictions compared to batch processing: - We may need to trade-off accuracy with space and time costs eg use approximate algorithms or sketches eg count-min for summarizing stream data. - Streaming jobs require to operate 24/7 and need to be able to adapt to code changes, failures and load variance. 22
  • 23. @s_kontopoulos Analyzing Data Streams ● Data flows from one or more sources through the engine and is written to one or more sinks. ● Two cases for processing: ○ Single event processing: event transformation, trigger an alarm on an error event ○ Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream. 23
  • 24. @s_kontopoulos Analyzing Data Streams ● Event aggregation introduces the concept of windowing wrt to the notion of time selected: ○ Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection. ○ Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second. ○ System Arrival or Ingestion time (the time that events arrived at the streaming system). ● Ideally event time = processing time. Reality is: there is skew. 24
  • 25. @s_kontopoulos Analyzing Data Streams ● Windows come in different flavors: ○ Tumbling windows discretize a stream into non-overlapping windows. ○ Sliding Windows: slide over the stream of data. 25 Images:https://flink.apache.org/news/2015/12/0 4/Introducing-windows.html
  • 26. @s_kontopoulos Analyzing Data Streams ● Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data. Marks the progress of the event time. ● Triggers: decide when the window is evaluated or purged. Affect latency & state kept. ● Late data: provide a threshold for how late data can be compared to current watermark value. 26
  • 27. @s_kontopoulos Analyzing Data Streams ● Recent advances (like the concept of watermarks etc) in Streaming are a result of the pioneer work: ○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013. ○ The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803 ● The-world-beyond-batch-streaming-101 (Tyler Akidau) ● The-world-beyond-batch-streaming-102 (Tyler Akidau) 27
  • 28. @s_kontopoulos Analyzing Data Streams ● Apache Beam is the open source successor of Google’s DataFlow. ● Provides the advanced semantics needed for the current needs in streaming applications. ● Google DataFlow, Apache Flink, Apache Spark follow that model. (https://beam.apache.org/documentation/runners/capabili ty-matrix) 28
  • 29. @s_kontopoulos Streams meet distributed log - I Streams fit naturally to the idea of the distributed log (eg. Kafka streams integrated with Kafka or Dell/EMC’s Pravega* uses stream as a storage primitive on top of Apache Bookeeper). *Pravega is an open-source streaming storage system. 29
  • 30. @s_kontopoulos Streams meet distributed log - II Distributed log possible use cases: ● Implement external services (micro-services) ● Implement internal operations (eg. kafka streams shuffling, fault-tolerance) 30
  • 31. @s_kontopoulos Processing Guarantees Many things can go wrong… ● At-most once ● At-least once ● Exactly once What are the boundaries? Within the streaming engine? How about end-to-end including sources and sinks? How about side effects like calling an external service? 31
  • 32. @s_kontopoulos Table Stream Duality Stream table : The aggregation of a stream of updates over time yields a table. Table stream: The observation of changes to a table over time yields a stream. Why is this useful? 32
  • 33. @s_kontopoulos Streaming SQL Queries Semantics ? How we define a join on an unbounded stream? Table join? There is a joint work from: https://docs.google.com/document/d/1wrla8mF_mmq-NW9sdJHYVgMyZsgCmHu mJJ5f5WUzTiM/ 33 Apache Flink
  • 34. @s_kontopoulos 34 Streaming Applications - Spark Structured Streaming API create spark session and read from kafka topic
  • 35. @s_kontopoulos Streaming Applications - Spark Structured Streaming API 35 sensor metadata emit complete output for every window update based on event-time to console. Setup a trigger.
  • 36. @s_kontopoulos Streaming Applications - Flink Streaming API 36 custom source initial sensor values
  • 37. @s_kontopoulos Streaming Applications - Flink Streaming API 37 watermark generation create some random data
  • 38. @s_kontopoulos Streaming Applications - Flink Streaming API 38 create a windowed keyed stream apply a function per window
  • 39. @s_kontopoulos Kafka Streams vs Beam Model - Trigger is more of an operational aspect compared to business parameters like the window length. How often do I update my computation (affecting latency and state size) is a non-functional requirement. - A Table covers both the case of immutable data and the case of updatable data. 39
  • 40. @s_kontopoulos Kafka Streams vs Beam Model KTable<Windowed<String>, Long> aggregated = inputStream .groupByKey() .reduce((aggValue, newValue) -> aggValue + newValue, TimeWindows.of(TimeUnit.MINUTES.toMillis(2)) .until(TimeUnit.DAYS.toMillis(1) /* keep for one day */), "queryStoreName"); props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100 /* milliseconds */); props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L); https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model 40