WordPress Websites for Engineers: Elevate Your Brand
Data streaming
1.
2. Alberto Paro
Master Degree in Computer Science Engineering at Politecnico di Milano
Big Data Practise Leader at NTTDATA Italia
Author of 4 books about ElasticSearch from 1 to 7.x + 6 Tech reviews
Big Data Trainer, Developer and Consulting on Big data Technologies (Akka,
Playframework, Apache Spark, Reactive Programming) e NoSQL (Accumulo,
Hbase, Cassandra, ElasticSearch, Kafka and MongoDB)
Evangelist for Scala e Scala.JS Language
3. SUMMARY
• Why?
• Architectures
• Message Brokers
• Streaming Frameworks
• Streaming Libraries
Data Streaming: Architetture e principali soluzioni - 16 Giugno 2020 (A.Paro)
4. T H E S T A R T O F T H E
J O U R N E Y
WHY
STREAMING
PROCESSING
5. NEED FOR STREAMING
• Real-time processing/unbounded data processing is
key winning (i.e. banking, finance, … sports)
• No more related to nightly batch processing.
• Real-time processing reduces time-to-market.
• Fast feedback on customers (i.e. campaign
monitoring)
• Real-time balancing of resources (demand-
response)
• Many of the systems we want to monitor and
understand what happening with the continuous
stream of events like heartbeats, machine metrics,
GPS signals.
• Distribute data processing in time (no more big
batch jobs, if possible)
• Reduce the processing power needed in big data
environments
• Application decoupling: separation of concern on
data (the Kafka way)
• Manage backpressure in application flow.
Business Technical
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
6. STANDARD STREAMING FLOW
• Source
• Message Broker
• Streaming Engine
• Destination
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
9. APACHE RABBITMQ
• RabbitMQ is an open-source message-broker software
(sometimes called message-oriented middleware) that
originally implemented the Advanced Message Queuing
Protocol (AMQP) and has since been extended with a plug-in
architecture to support Streaming Text Oriented Messaging
Protocol (STOMP), MQ Telemetry Transport (MQTT), and
other protocols.
• The RabbitMQ server program is written in the Erlang
programming language and is built on the Open Telecom
Platform framework for clustering and failover.
• Client libraries to interface with the broker are available for all
major programming languages.
• Rabbit Technologies Ltd. originally developed RabbitMQ.
Rabbit Technologies started as a joint venture between LShift
and CohesiveFT in 2007, and was acquired in April 2010 by
SpringSource, a division of VMware. The project became part
of Pivotal Software in May 2013.
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
10. APACHE KAFKA
• Kafka Streams is a client library of Kafka for real-time stream
processing and analyzing data stored in Kafka brokers.
• The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more
topics and producing an output stream to one or more output
topics, effectively transforming the input streams to output
streams.
• In Kafka a stream processor is anything that takes continual
streams of data from input topics, performs some processing
on this input, and produces continual streams of data to
output topics.
• It is possible to do simple processing directly using the
producer and consumer APIs.
• This allows building applications that do non-trivial processing
that compute aggregations off of streams or join streams
together.
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
12. APACHE PULSAR
• Apache Pulsar is an open-source distributed pub-sub messaging
system originally created at Yahoo.
• Like Kafka, Pulsar uses the concept of topics and subscriptions to
create order from large amounts of streaming data in a scalable
and low-latency manner. In addition to publish and subscribe,
Pulsar can support point-to-point message queuing from a single
API. Like Kafka, the project relies on Zookeeper for storage, and it
also utilizes Apache BookKeeper for ordering guarantees.
• The creators of Pulsar say they developed it to address several
shortcomings of existing open source messaging systems. It has
been running in production at Yahoo since 2014 and was open
sourced in 2016. Pulsar is backed by a commercial open source
outfit called Streamlio.
• Pulsar’s strengths include multi-tenancy, geo-replication, and
strong durability guarantees, high message throughput, as well as a
single API for both queuing and publish-subscribe messaging.
Scaling a Pulsar cluster is as easy as adding additional nodes.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
13. O N L Y T H E M O S T U S E D
STREAMING
FRAMEWORKS
14. FRAMEWORKS – 10 QUICK LIST
Apache Spark
Apache Flink
Apache Samza
Apache Storm
Apache Kafka
Apache Flume
Apache Nifi
Apache Ignite
Apache Apex
Apache Beam
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
15. SPARK STREAMING – OLD STYLE
• Stream converted in microbatch
• No Watermark
• No time-based event management
• Lose data in several ways
Old School
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
16. SPARK STRUCTURED STREAMING
• Real Streaming (windowing, triggers,
watermarks)
• Natively Dataframes (no RDD)
• Process with Vent Time, handling late
data
• End-to-end guarantee
Introduced in spark 2.(4)x
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
17. APACHE FLINK
• Flink relies on a streaming execution model, which is an
intuitive fit for processing unbounded datasets.
• Streaming execution is continuous processing on data that is
continuously produced and alignment between the type of
dataset and the type of execution model offers many
advantages with regard to accuracy and performance.
• It provides results that are accurate, even in the case of out-
of-order or late-arriving data
• It is stateful and fault-tolerant and can seamlessly recover
from failures while maintaining exactly-once application state
• It performs at large scale, running on thousands of nodes with
very good throughput and latency characteristics
• Flink guarantees exactly-once semantics for stateful
computations.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
18. APACHE NIFI
• Apache NiFi supports powerful and scalable directed graphs
of data routing, transformation, and system mediation logic.
• Web-based user interface:
▪ Seamless experience between design, control, feedback,
and monitoring.
• Highly configurable:
▪ Loss tolerant vs guaranteed delivery, Low latency vs high
throughput, Dynamic prioritization, Flow can be modified
at runtime, Back pressure.
• Designed for extension:
▪ Build your own processors and more, enables rapid
development and effective testing.
• Security:
• SSL, SSH, HTTPS, encrypted content, etc...
• Multi-tenant authorization and internal
authorization/policy management
19. APACHE IGNITE
• Apache Ignite In-Memory Data Fabric is a high-performance,
integrated and distributed in-memory platform for computing
and transacting on large-scale data sets in real-time, orders of
magnitude faster than possible with traditional disk-based or
flash-based technologies.
• You can view Ignite as a collection of independent, well-
integrated, in-memory components geared to improve
performance and scalability of your application. Some of these
components include:
• Advanced Clustering, Data Grid
• SQL Grid, Streaming & CEP
• Compute Grid, Service Grid
• Ignite File System, Distributed Data Structures
• Distributed Messaging, Distributed Events
• Hadoop Accelerator, Spark Shared RDDs
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
20. APACHE BEAM
• Apache Beam is an open source, unified programming model that
you can use to create a data processing pipeline.
• You start by building a program that defines the pipeline using one
of the open source Beam SDKs.
• The pipeline is then executed by one of Beam’s supported
distributed processing back-ends, which include Apache Apex,
Apache Flink, Apache Spark, and Google Cloud Dataflow.
• Apache Beam provides an advanced unified programming model,
allowing you to implement batch and streaming data processing
jobs that can run on any execution engine.
• Apache Beam is:
▪ UNIFIED - Use a single programming model for both batch
and streaming use cases.
▪ PORTABLE - Execute pipelines on multiple execution
environments, including Apache Apex, Apache Flink, Apache
Spark, and Google Cloud Dataflow.
▪ EXTENSIBLE - Write and share new SDKs, IO connectors, and
transformation libraries.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
22. O N L Y T H E M O S T U S E D
STREAMING
LIBRARIES
23. REACTIVE STREAMS
”””Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back
pressure.”””
In future all the data processing will be managed by streams.
Adoptions:
Akka Streams
MongoDB
Ratpack
Reactive Rabbit – driver for RabbitMQ/AMQP
Spring and Pivotal Project Reactor
Netflix RxJava
Slick 3.0
Vert.x 3.0
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
24.
25. BACK PRESSURE CONCEPTS
The main players in managing flow are Publishers and
Subscribers (Consumers)
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
Data Streaming is real-time / unbounded data processing.
▪ Real-time processing and analytics bears the promise of making organizations more
▪ Many of the systems we want to monitor and understand what happening with the continuous stream of events like heartbeats, ocean currents, machine metrics, GPS signals.
▪ Even analysis of sporadic events such as website traffic can benefit from a streaming data approach.
▪ There are many potential advantages of handling data as streams, but until recently this method was somewhat difficult to do well.
▪ Streaming data and real-time analytics formed a fairly specialized undertaking rather than a widespread approach.
▪ ApacheSpark:
▪ Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation.
▪ It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.
▪ The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
▪ Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming.
▪ Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
Spark Streaming
Spark Streaming is a separate library in Spark to process continuously flowing streaming data. It provides us with the DStream API, which is powered by Spark RDDs. DStreams provide us data divided into chunks as RDDs received from the source of streaming to be processed and, after processing, sends it to the destination. Cool, right?!
Structured Streaming
From the Spark 2.x release onwards, Structured Streaming came into the picture. Built on the Spark SQL library, Structured Streaming is another way to handle streaming with Spark. This model of streaming is based on Dataframe and Dataset APIs. Hence, with this library, we can easily apply any SQL query (using the DataFrame API) or Scala operations (using DataSet API) on streaming data.
Okay, so that was the summarized theory for both ways of streaming in Spark. Now we need to compare the two.
Distinctions
1. Real Streaming
What does real streaming imply? Data which is unbounded and is being processed upon being received from the source. This definition is satisfiable (more or less).
If we talk about Spark Streaming, this is not the case. Spark Streaming works on something we call a micro batch. The stream pipeline is registered with some operations and Spark polls the source after every batch duration (defined in the application) and then a batch is created of the received data, i.e. each incoming record belongs to a batch of DStream. Each batch represents an RDD.
Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming. In Structured Streaming, there is no batch concept. The received data in a trigger is appended to the continuously flowing data stream. Each row of the data stream is processed and the result is updated into the unbounded result table. How you want your result (updated, new result only, or all the results) depends on the mode of your operations (Complete, Update, Append).
Winner of this round: Structured Streaming.
2. RDD vs. DataFrames/DataSet
Another distinction can be the use case of different APIs in both streaming models. In summary, we read that Spark Streaming works on the DStream API, which is internally using RDDs and Structured Streaming uses DataFrame and Dataset APIs to perform streaming operations. So, it is a straight comparison between using RDDs or DataFrames. There are several blogs available which compare DataFrames and RDDs in terms of `performance` and `ease of use.` This is a good read for RDD v/s Dataframes. All those comparisons lead to one result: that DataFrames are more optimized in terms of processing and provide more options for aggregations and other operations with a variety of functions available (many more functions are now supported natively in Spark 2.4).
So Structured Streaming wins here with flying colors.
3. Processing With the Vent Time, Handling Late Data
One big issue in the streaming world is how to process data according to the event-time. Event-time is the time when the event actually happened. It is not necessary for the source of the streaming engine to prove data in real-time. There may be latencies in data generation and handing over the data to the processing engine. There is no such option in Spark Streaming to work on the data using the event-time. It only works with the timestamp when the data is received by the Spark. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss. On the other hand, Structured Streaming provides the functionality to process data on the basis of event-time when the timestamp of the event is included in the data received. This is a major feature introduced in Structured Streaming which provides a different way of processing the data according to the time of data generation in the real world. With this, we can handle data coming in late and get more accurate results.
With event-time handling of late data, Structured Streaming outweighs Spark Streaming.
4. End-to-End Guarantees
Every application requires fault tolerance and end-to-end guarantees of data delivery. Whenever the application fails, it must be able to restart from the same point where it failed in order to avoid data loss and duplication. To provide fault tolerance, Spark Streaming and Structured Streaming both use the checkpointing to save the progress of a job. But this approach still has many holes which may cause data loss.
Other than checkpointing, Structured Streaming has applied two conditions to recover from any error:
The source must be replayable.
The sinks must support idempotent operations to support reprocessing in case of failures.
Here's a link to the docs to learn more.
With restricted sinks, Spark Structured Streaming always provides end-to-end, exactly once semantics. Way to go Structured Streaming!
5. Restricted or Flexible
Sink: The destination of a streaming operation. It can be external storage, a simple output to console, or any action
With Spark Streaming, there is no restriction to use any type of sink. Here we have the method foreachRDD to perform some action on the stream. This method returns us the RDDs created by each batch one-by-one and we can perform any actions over them, like saving to storage or performing some computations. We can cache an RDD and perform multiple actions on it as well (even sending the data to multiple databases).
But in Structures Streaming, until v2.3, we had a limited number of output sinks and, with one sink, only one operation could be performed and we could not save the output to multiple external storages. To use a custom sink, the user needed to implement ForeachWriter. But here comes Spark 2.4, and with it we get a new sink called foreachBatch. This sink gives us the resultant output table as a DataFrame and hence we can use this DataFrame to perform our custom operations.
With this new sink, the `restricted` Structured Streaming is now more `flexible` and gives it an edge over the Spark Streaming and other over flexible sinks.
Flink relies on a streaming execution model, which is an intuitive fit for processing
unbounded datasets.
▪ Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance.
▪ It provides results that are accurate, even in the case of out-of-order or late-arriving data
▪ It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state
▪ It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics
▪ Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
Flink relies on a streaming execution model, which is an intuitive fit for processing
unbounded datasets.
▪ Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance.
▪ It provides results that are accurate, even in the case of out-of-order or late-arriving data
▪ It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state
▪ It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics
▪ Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
Flink relies on a streaming execution model, which is an intuitive fit for processing
unbounded datasets.
▪ Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance.
▪ It provides results that are accurate, even in the case of out-of-order or late-arriving data
▪ It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state
▪ It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics
▪ Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png