4. @s_kontopoulos
Insights
Data Insight: A conclusion, a piece of information which can be used to make
actions and optimize a decision-making process.
Customer Insight: A non-obvious understanding about your customers, which if
acted upon, has the potential to change their behaviour for mutual benefit
Customer insight, Wikipedia
DAT
A
INFO INSIGHTS ACTIONS
4
6. @s_kontopoulos
Streaming Analytics - Bridging the Gap
Collect Analyze
Data Output Flow
(alarms, visualizations, ML scoring, etc)Data Input Flow
(sensors, mobile apps, etc)
Permanent Store
DATA FLOW
6
7. @s_kontopoulos
Streaming Analytics
“Streaming Analytics is the acquisition and analysis of the data at the moment it
streams into the system. It is a process done in a near real-time(NRT) fashion and
analysis results trigger specific actions for the system to execute.“
● No constraints or deadlines in the way they exist in RT systems
● Processing delay (end-to-end) varies and depends on the application ( < 1 ms
to minutes)
7
8. @s_kontopoulos
Big Data vs Fast Data
● Data in motion is the key characteristic.
● Fast Data is the new Big Data!
8
Two categories of systems: batch vs
streaming systems.
12. @s_kontopoulos
Streaming Data Pipeline
In memory processing as data flows...
12
Analysis
New Data NR-Time View
Apache Flink
Akka Streams
Streaming Platform
Apache Kafka
Streams
13. @s_kontopoulos
Streaming Platforms
Its an ecosystem/environment that supports building and running streaming
applications. At its core it uses a streaming engine. Example of tools:
● A durable pub/sub component to fetch or store data
● A streaming engine
● A registry for storing data metadata like the data format etc.
13
14. @s_kontopoulos
Streaming Platforms - Some Examples
- Fast Data Platform (https://www.lightbend.com/products/fast-data-platform)
- Confluent Enterprise (https://www.confluent.io/product/confluent-platform)
- Da-Platform-2 (https://data-artisans.com/da-platform-2)
- Databricks Platform (https://databricks.com/product/unified-analytics-platform)
- IBM Streams
(https://www.ibm.com/analytics/us/en/technology/stream-computing/
- MapR Streams (https://mapr.com/products/mapr-streams/)
- Pravega (http://pravega.io)
...
14
15. @s_kontopoulos
Streaming Engine - the Core
A streaming engine provides the basic capabilities for developing and deploying
streaming applications.
Some systems like Kafka Streams or Akka Streams which
are just libraries don’t cover deployment effectively.
15
16. @s_kontopoulos
Streaming Engine - Key Features I
● Fault - Tolerance
● Processing Guarantees
● Checkpointing
● Streaming SQL
● Batch - Streaming API
● Language Integration (Python, Java, Scala, R)
● Stateful Management, User Session state
● Locality Awareness
● Backpressure
16
17. @s_kontopoulos
Streaming Engine - Key Features II
● Multi-Scheduler Support: Yarn, Mesos, Kubernetes
● Micro batching vs Data Flow
● ML, Graph, CEP
● Connectors (Sources, Sinks)
● Memory - Disk management (shuffling)
● Security (Kerberos etc)
17
18. @s_kontopoulos
DataFlow Execution Model
User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG. The data-sets are considered as immutable
distributed data. DAG is shipped to nodes where the data lie, computation is
executed and results are sent back to the user.
18
Spark Model
example
Flink model - FLIP 6
20. @s_kontopoulos
The Modern Enterprise Fast Data Architecture
20
Infrastructure (on premise, cloud)
Cluster scheduler
(Yarn, Standalone, Kubernetes, Mesos)
Fast Data
Apps
Micro
Services
ML
Operations
Monitoring
Security
Governance
Permanent
Storage
(HDFS, S3...)
Streaming Platform
(pub/sub, streaming engine, etc)
BI Data Lake
22. @s_kontopoulos
Analyzing Data Streams
Processing infinite data streams imposes certain restrictions compared to batch
processing:
- We may need to trade-off accuracy with space and time costs eg use
approximate algorithms or sketches eg count-min for summarizing stream
data.
- Streaming jobs require to operate 24/7 and need to be able to adapt to code
changes, failures and load variance.
22
23. @s_kontopoulos
Analyzing Data Streams
● Data flows from one or more sources through the engine and is written to one
or more sinks.
● Two cases for processing:
○ Single event processing: event transformation, trigger an alarm on an error event
○ Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
23
24. @s_kontopoulos
Analyzing Data Streams
● Event aggregation introduces the concept of windowing wrt to the notion of
time selected:
○ Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
○ Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
○ System Arrival or Ingestion time (the time that events arrived at the streaming system).
● Ideally event time = processing time. Reality is: there is skew.
24
25. @s_kontopoulos
Analyzing Data Streams
● Windows come in different flavors:
○ Tumbling windows discretize a stream into non-overlapping windows.
○ Sliding Windows: slide over the stream of data.
25
Images:https://flink.apache.org/news/2015/12/0
4/Introducing-windows.html
26. @s_kontopoulos
Analyzing Data Streams
● Watermarks: indicates that no elements with a timestamp older or equal to
the watermark timestamp should arrive for the specific window of data. Marks
the progress of the event time.
● Triggers: decide when the window is evaluated or purged. Affect latency &
state kept.
● Late data: provide a threshold for how late data can be compared to current
watermark value.
26
27. @s_kontopoulos
Analyzing Data Streams
● Recent advances (like the concept of watermarks etc) in Streaming are a
result of the pioneer work:
○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB
2013.
○ The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803
● The-world-beyond-batch-streaming-101 (Tyler Akidau)
● The-world-beyond-batch-streaming-102 (Tyler Akidau)
27
28. @s_kontopoulos
Analyzing Data Streams
● Apache Beam is the open source successor of Google’s
DataFlow.
● Provides the advanced semantics needed for the current
needs in streaming applications.
● Google DataFlow, Apache Flink, Apache Spark follow
that model.
(https://beam.apache.org/documentation/runners/capabili
ty-matrix)
28
29. @s_kontopoulos
Streams meet distributed log - I
Streams fit naturally to the idea of the distributed log (eg. Kafka streams integrated
with Kafka or Dell/EMC’s Pravega* uses stream as a storage primitive on top of
Apache Bookeeper).
*Pravega is an open-source streaming storage system.
29
30. @s_kontopoulos
Streams meet distributed log - II
Distributed log possible use cases:
● Implement external services (micro-services)
● Implement internal operations (eg. kafka streams shuffling, fault-tolerance)
30
31. @s_kontopoulos
Processing Guarantees
Many things can go wrong…
● At-most once
● At-least once
● Exactly once
What are the boundaries?
Within the streaming engine?
How about end-to-end including sources and sinks?
How about side effects like calling an external service?
31
32. @s_kontopoulos
Table Stream Duality
Stream table : The aggregation of a stream of updates over time yields a
table.
Table stream: The observation of changes to a table over time yields a
stream.
Why is this useful?
32
33. @s_kontopoulos
Streaming SQL Queries
Semantics ? How we define a join on an unbounded stream? Table join?
There is a joint work from:
https://docs.google.com/document/d/1wrla8mF_mmq-NW9sdJHYVgMyZsgCmHu
mJJ5f5WUzTiM/
33
Apache Flink
35. @s_kontopoulos
Streaming Applications - Spark Structured Streaming API
35
sensor metadata
emit complete output for every
window update based on
event-time to console. Setup a
trigger.
39. @s_kontopoulos
Kafka Streams vs Beam Model
- Trigger is more of an operational aspect compared to business parameters
like the window length. How often do I update my computation (affecting
latency and state size) is a non-functional requirement.
- A Table covers both the case of immutable data and the case of updatable
data.
39
40. @s_kontopoulos
Kafka Streams vs Beam Model
KTable<Windowed<String>, Long> aggregated = inputStream
.groupByKey()
.reduce((aggValue, newValue) -> aggValue + newValue,
TimeWindows.of(TimeUnit.MINUTES.toMillis(2))
.until(TimeUnit.DAYS.toMillis(1) /* keep for one day */),
"queryStoreName");
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100 /* milliseconds */);
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 10 * 1024 * 1024L);
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model
40