SlideShare uma empresa Scribd logo
1 de 32
Alberto Paro
 Master Degree in Computer Science Engineering at Politecnico di Milano
 Big Data Practise Leader at NTTDATA Italia
 Author of 4 books about ElasticSearch from 1 to 7.x + 6 Tech reviews
 Big Data Trainer, Developer and Consulting on Big data Technologies (Akka,
Playframework, Apache Spark, Reactive Programming) e NoSQL (Accumulo,
 Hbase, Cassandra, ElasticSearch, Kafka and MongoDB)
 Evangelist for Scala e Scala.JS Language
SUMMARY
• Why?
• Architectures
• Message Brokers
• Streaming Frameworks
• Streaming Libraries
Data Streaming: Architetture e principali soluzioni - 16 Giugno 2020 (A.Paro)
T H E S T A R T O F T H E
J O U R N E Y
WHY
STREAMING
PROCESSING
NEED FOR STREAMING
• Real-time processing/unbounded data processing is
key winning (i.e. banking, finance, … sports)
• No more related to nightly batch processing.
• Real-time processing reduces time-to-market.
• Fast feedback on customers (i.e. campaign
monitoring)
• Real-time balancing of resources (demand-
response)
• Many of the systems we want to monitor and
understand what happening with the continuous
stream of events like heartbeats, machine metrics,
GPS signals.
• Distribute data processing in time (no more big
batch jobs, if possible)
• Reduce the processing power needed in big data
environments
• Application decoupling: separation of concern on
data (the Kafka way)
• Manage backpressure in application flow.
Business Technical
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
STANDARD STREAMING FLOW
• Source
• Message Broker
• Streaming Engine
• Destination
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
CONFLUENT KAFKA LIKE
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
T O P T H R E E
MESSAGE
BROKERS
APACHE RABBITMQ
• RabbitMQ is an open-source message-broker software
(sometimes called message-oriented middleware) that
originally implemented the Advanced Message Queuing
Protocol (AMQP) and has since been extended with a plug-in
architecture to support Streaming Text Oriented Messaging
Protocol (STOMP), MQ Telemetry Transport (MQTT), and
other protocols.
• The RabbitMQ server program is written in the Erlang
programming language and is built on the Open Telecom
Platform framework for clustering and failover.
• Client libraries to interface with the broker are available for all
major programming languages.
• Rabbit Technologies Ltd. originally developed RabbitMQ.
Rabbit Technologies started as a joint venture between LShift
and CohesiveFT in 2007, and was acquired in April 2010 by
SpringSource, a division of VMware. The project became part
of Pivotal Software in May 2013.
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE KAFKA
• Kafka Streams is a client library of Kafka for real-time stream
processing and analyzing data stored in Kafka brokers.
• The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more
topics and producing an output stream to one or more output
topics, effectively transforming the input streams to output
streams.
• In Kafka a stream processor is anything that takes continual
streams of data from input topics, performs some processing
on this input, and produces continual streams of data to
output topics.
• It is possible to do simple processing directly using the
producer and consumer APIs.
• This allows building applications that do non-trivial processing
that compute aggregations off of streams or join streams
together.
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE KAFKA
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE PULSAR
• Apache Pulsar is an open-source distributed pub-sub messaging
system originally created at Yahoo.
• Like Kafka, Pulsar uses the concept of topics and subscriptions to
create order from large amounts of streaming data in a scalable
and low-latency manner. In addition to publish and subscribe,
Pulsar can support point-to-point message queuing from a single
API. Like Kafka, the project relies on Zookeeper for storage, and it
also utilizes Apache BookKeeper for ordering guarantees.
• The creators of Pulsar say they developed it to address several
shortcomings of existing open source messaging systems. It has
been running in production at Yahoo since 2014 and was open
sourced in 2016. Pulsar is backed by a commercial open source
outfit called Streamlio.
• Pulsar’s strengths include multi-tenancy, geo-replication, and
strong durability guarantees, high message throughput, as well as a
single API for both queuing and publish-subscribe messaging.
Scaling a Pulsar cluster is as easy as adding additional nodes.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
O N L Y T H E M O S T U S E D
STREAMING
FRAMEWORKS
FRAMEWORKS – 10 QUICK LIST
 Apache Spark
 Apache Flink
 Apache Samza
 Apache Storm
 Apache Kafka
 Apache Flume
 Apache Nifi
 Apache Ignite
 Apache Apex
 Apache Beam
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
SPARK STREAMING – OLD STYLE
• Stream converted in microbatch
• No Watermark
• No time-based event management
• Lose data in several ways
Old School
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
SPARK STRUCTURED STREAMING
• Real Streaming (windowing, triggers,
watermarks)
• Natively Dataframes (no RDD)
• Process with Vent Time, handling late
data
• End-to-end guarantee
Introduced in spark 2.(4)x
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE FLINK
• Flink relies on a streaming execution model, which is an
intuitive fit for processing unbounded datasets.
• Streaming execution is continuous processing on data that is
continuously produced and alignment between the type of
dataset and the type of execution model offers many
advantages with regard to accuracy and performance.
• It provides results that are accurate, even in the case of out-
of-order or late-arriving data
• It is stateful and fault-tolerant and can seamlessly recover
from failures while maintaining exactly-once application state
• It performs at large scale, running on thousands of nodes with
very good throughput and latency characteristics
• Flink guarantees exactly-once semantics for stateful
computations.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE NIFI
• Apache NiFi supports powerful and scalable directed graphs
of data routing, transformation, and system mediation logic.
• Web-based user interface:
▪ Seamless experience between design, control, feedback,
and monitoring.
• Highly configurable:
▪ Loss tolerant vs guaranteed delivery, Low latency vs high
throughput, Dynamic prioritization, Flow can be modified
at runtime, Back pressure.
• Designed for extension:
▪ Build your own processors and more, enables rapid
development and effective testing.
• Security:
• SSL, SSH, HTTPS, encrypted content, etc...
• Multi-tenant authorization and internal
authorization/policy management
APACHE IGNITE
• Apache Ignite In-Memory Data Fabric is a high-performance,
integrated and distributed in-memory platform for computing
and transacting on large-scale data sets in real-time, orders of
magnitude faster than possible with traditional disk-based or
flash-based technologies.
• You can view Ignite as a collection of independent, well-
integrated, in-memory components geared to improve
performance and scalability of your application. Some of these
components include:
• Advanced Clustering, Data Grid
• SQL Grid, Streaming & CEP
• Compute Grid, Service Grid
• Ignite File System, Distributed Data Structures
• Distributed Messaging, Distributed Events
• Hadoop Accelerator, Spark Shared RDDs
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE BEAM
• Apache Beam is an open source, unified programming model that
you can use to create a data processing pipeline.
• You start by building a program that defines the pipeline using one
of the open source Beam SDKs.
• The pipeline is then executed by one of Beam’s supported
distributed processing back-ends, which include Apache Apex,
Apache Flink, Apache Spark, and Google Cloud Dataflow.
• Apache Beam provides an advanced unified programming model,
allowing you to implement batch and streaming data processing
jobs that can run on any execution engine.
• Apache Beam is:
▪ UNIFIED - Use a single programming model for both batch
and streaming use cases.
▪ PORTABLE - Execute pipelines on multiple execution
environments, including Apache Apex, Apache Flink, Apache
Spark, and Google Cloud Dataflow.
▪ EXTENSIBLE - Write and share new SDKs, IO connectors, and
transformation libraries.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE BEAM
O N L Y T H E M O S T U S E D
STREAMING
LIBRARIES
REACTIVE STREAMS
”””Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back
pressure.”””
In future all the data processing will be managed by streams.
Adoptions:
 Akka Streams
 MongoDB
 Ratpack
 Reactive Rabbit – driver for RabbitMQ/AMQP
 Spring and Pivotal Project Reactor
 Netflix RxJava
 Slick 3.0
 Vert.x 3.0
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
BACK PRESSURE CONCEPTS
 The main players in managing flow are Publishers and
Subscribers (Consumers)
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
 Dropping
 Buffer overflow
BACK PRESSURE CONCEPTS
BACK-PRESSURE CONCEPTS
https://doc.akka.io/docs/alpakka/current/
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
ZIO STREAM
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
LIBRARY STREAM COMPARISON
LIBRARY STREAM COMPARISON
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
ZIO STREAM
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
Data streaming

Mais conteúdo relacionado

Mais procurados

Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...Databricks
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiSpark Summit
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK StackKnoldus Inc.
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be StreamedDatabricks
 

Mais procurados (20)

Tailored for Spark
Tailored for SparkTailored for Spark
Tailored for Spark
 
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-FiHow Apache Spark Is Helping Tame the Wild West of Wi-Fi
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be Streamed
 

Semelhante a Data streaming

Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarScyllaDB
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltreMarco Parenzan
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache BeamKnoldus Inc.
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 

Semelhante a Data streaming (20)

Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Building an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache PulsarBuilding an Event Streaming Architecture with Apache Pulsar
Building an Event Streaming Architecture with Apache Pulsar
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 

Mais de Alberto Paro

LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19Alberto Paro
 
2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patternsAlberto Paro
 
Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017Alberto Paro
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup Alberto Paro
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big DataAlberto Paro
 
What's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - FirenzeWhat's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - FirenzeAlberto Paro
 
ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014Alberto Paro
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSAlberto Paro
 

Mais de Alberto Paro (10)

LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19
 
2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns2018 07-11 - kafka integration patterns
2018 07-11 - kafka integration patterns
 
Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017Elasticsearch in architetture Big Data - EsInADay-2017
Elasticsearch in architetture Big Data - EsInADay-2017
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data2016 02-24 - Piattaforme per i Big Data
2016 02-24 - Piattaforme per i Big Data
 
What's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - FirenzeWhat's Big Data? - Big Data Tech - 2015 - Firenze
What's Big Data? - Big Data Tech - 2015 - Firenze
 
ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014ElasticSearch Meetup 30 - 10 - 2014
ElasticSearch Meetup 30 - 10 - 2014
 
Scala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJSScala Italy 2015 - Hands On ScalaJS
Scala Italy 2015 - Hands On ScalaJS
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Data streaming

  • 1.
  • 2. Alberto Paro  Master Degree in Computer Science Engineering at Politecnico di Milano  Big Data Practise Leader at NTTDATA Italia  Author of 4 books about ElasticSearch from 1 to 7.x + 6 Tech reviews  Big Data Trainer, Developer and Consulting on Big data Technologies (Akka, Playframework, Apache Spark, Reactive Programming) e NoSQL (Accumulo,  Hbase, Cassandra, ElasticSearch, Kafka and MongoDB)  Evangelist for Scala e Scala.JS Language
  • 3. SUMMARY • Why? • Architectures • Message Brokers • Streaming Frameworks • Streaming Libraries Data Streaming: Architetture e principali soluzioni - 16 Giugno 2020 (A.Paro)
  • 4. T H E S T A R T O F T H E J O U R N E Y WHY STREAMING PROCESSING
  • 5. NEED FOR STREAMING • Real-time processing/unbounded data processing is key winning (i.e. banking, finance, … sports) • No more related to nightly batch processing. • Real-time processing reduces time-to-market. • Fast feedback on customers (i.e. campaign monitoring) • Real-time balancing of resources (demand- response) • Many of the systems we want to monitor and understand what happening with the continuous stream of events like heartbeats, machine metrics, GPS signals. • Distribute data processing in time (no more big batch jobs, if possible) • Reduce the processing power needed in big data environments • Application decoupling: separation of concern on data (the Kafka way) • Manage backpressure in application flow. Business Technical Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 6. STANDARD STREAMING FLOW • Source • Message Broker • Streaming Engine • Destination Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 7. CONFLUENT KAFKA LIKE Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 8. T O P T H R E E MESSAGE BROKERS
  • 9. APACHE RABBITMQ • RabbitMQ is an open-source message-broker software (sometimes called message-oriented middleware) that originally implemented the Advanced Message Queuing Protocol (AMQP) and has since been extended with a plug-in architecture to support Streaming Text Oriented Messaging Protocol (STOMP), MQ Telemetry Transport (MQTT), and other protocols. • The RabbitMQ server program is written in the Erlang programming language and is built on the Open Telecom Platform framework for clustering and failover. • Client libraries to interface with the broker are available for all major programming languages. • Rabbit Technologies Ltd. originally developed RabbitMQ. Rabbit Technologies started as a joint venture between LShift and CohesiveFT in 2007, and was acquired in April 2010 by SpringSource, a division of VMware. The project became part of Pivotal Software in May 2013. Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 10. APACHE KAFKA • Kafka Streams is a client library of Kafka for real-time stream processing and analyzing data stored in Kafka brokers. • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. • In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics. • It is possible to do simple processing directly using the producer and consumer APIs. • This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together. Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 11. APACHE KAFKA Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 12. APACHE PULSAR • Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo. • Like Kafka, Pulsar uses the concept of topics and subscriptions to create order from large amounts of streaming data in a scalable and low-latency manner. In addition to publish and subscribe, Pulsar can support point-to-point message queuing from a single API. Like Kafka, the project relies on Zookeeper for storage, and it also utilizes Apache BookKeeper for ordering guarantees. • The creators of Pulsar say they developed it to address several shortcomings of existing open source messaging systems. It has been running in production at Yahoo since 2014 and was open sourced in 2016. Pulsar is backed by a commercial open source outfit called Streamlio. • Pulsar’s strengths include multi-tenancy, geo-replication, and strong durability guarantees, high message throughput, as well as a single API for both queuing and publish-subscribe messaging. Scaling a Pulsar cluster is as easy as adding additional nodes. Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 13. O N L Y T H E M O S T U S E D STREAMING FRAMEWORKS
  • 14. FRAMEWORKS – 10 QUICK LIST  Apache Spark  Apache Flink  Apache Samza  Apache Storm  Apache Kafka  Apache Flume  Apache Nifi  Apache Ignite  Apache Apex  Apache Beam Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 15. SPARK STREAMING – OLD STYLE • Stream converted in microbatch • No Watermark • No time-based event management • Lose data in several ways Old School Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 16. SPARK STRUCTURED STREAMING • Real Streaming (windowing, triggers, watermarks) • Natively Dataframes (no RDD) • Process with Vent Time, handling late data • End-to-end guarantee Introduced in spark 2.(4)x Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 17. APACHE FLINK • Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. • Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. • It provides results that are accurate, even in the case of out- of-order or late-arriving data • It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state • It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics • Flink guarantees exactly-once semantics for stateful computations. Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 18. APACHE NIFI • Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. • Web-based user interface: ▪ Seamless experience between design, control, feedback, and monitoring. • Highly configurable: ▪ Loss tolerant vs guaranteed delivery, Low latency vs high throughput, Dynamic prioritization, Flow can be modified at runtime, Back pressure. • Designed for extension: ▪ Build your own processors and more, enables rapid development and effective testing. • Security: • SSL, SSH, HTTPS, encrypted content, etc... • Multi-tenant authorization and internal authorization/policy management
  • 19. APACHE IGNITE • Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies. • You can view Ignite as a collection of independent, well- integrated, in-memory components geared to improve performance and scalability of your application. Some of these components include: • Advanced Clustering, Data Grid • SQL Grid, Streaming & CEP • Compute Grid, Service Grid • Ignite File System, Distributed Data Structures • Distributed Messaging, Distributed Events • Hadoop Accelerator, Spark Shared RDDs Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 20. APACHE BEAM • Apache Beam is an open source, unified programming model that you can use to create a data processing pipeline. • You start by building a program that defines the pipeline using one of the open source Beam SDKs. • The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. • Apache Beam provides an advanced unified programming model, allowing you to implement batch and streaming data processing jobs that can run on any execution engine. • Apache Beam is: ▪ UNIFIED - Use a single programming model for both batch and streaming use cases. ▪ PORTABLE - Execute pipelines on multiple execution environments, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. ▪ EXTENSIBLE - Write and share new SDKs, IO connectors, and transformation libraries. Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 22. O N L Y T H E M O S T U S E D STREAMING LIBRARIES
  • 23. REACTIVE STREAMS ”””Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure.””” In future all the data processing will be managed by streams. Adoptions:  Akka Streams  MongoDB  Ratpack  Reactive Rabbit – driver for RabbitMQ/AMQP  Spring and Pivotal Project Reactor  Netflix RxJava  Slick 3.0  Vert.x 3.0 Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 24.
  • 25. BACK PRESSURE CONCEPTS  The main players in managing flow are Publishers and Subscribers (Consumers) Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 26.  Dropping  Buffer overflow BACK PRESSURE CONCEPTS
  • 27. BACK-PRESSURE CONCEPTS https://doc.akka.io/docs/alpakka/current/ Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 28. ZIO STREAM Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 30. LIBRARY STREAM COMPARISON Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 31. ZIO STREAM Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)

Notas do Editor

  1. Data Streaming is real-time / unbounded data processing. ▪  Real-time processing and analytics bears the promise of making organizations more ▪  Many of the systems we want to monitor and understand what happening with the continuous stream of events like heartbeats, ocean currents, machine metrics, GPS signals. ▪  Even analysis of sporadic events such as website traffic can benefit from a streaming data approach. ▪  There are many potential advantages of handling data as streams, but until recently this method was somewhat difficult to do well. ▪  Streaming data and real-time analytics formed a fairly specialized undertaking rather than a widespread approach.
  2. ▪ ApacheSpark: ▪  Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. ▪  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. ▪  The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. ▪  Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. ▪  Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
  3. Spark Streaming Spark Streaming is a separate library in Spark to process continuously flowing streaming data. It provides us with the DStream API, which is powered by Spark RDDs. DStreams provide us data divided into chunks as RDDs received from the source of streaming to be processed and, after processing, sends it to the destination. Cool, right?! Structured Streaming From the Spark 2.x release onwards, Structured Streaming came into the picture. Built on the Spark SQL library, Structured Streaming is another way to handle streaming with Spark. This model of streaming is based on Dataframe and Dataset APIs. Hence, with this library, we can easily apply any SQL query (using the DataFrame API) or Scala operations (using DataSet API) on streaming data. Okay, so that was the summarized theory for both ways of streaming in Spark. Now we need to compare the two. Distinctions 1. Real Streaming What does real streaming imply? Data which is unbounded and is being processed upon being received from the source. This definition is satisfiable (more or less). If we talk about Spark Streaming, this is not the case. Spark Streaming works on something we call a micro batch. The stream pipeline is registered with some operations and Spark polls the source after every batch duration (defined in the application) and then a batch is created of the received data, i.e. each incoming record belongs to a batch of DStream. Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming. In Structured Streaming, there is no batch concept. The received data in a trigger is appended to the continuously flowing data stream. Each row of the data stream is processed and the result is updated into the unbounded result table. How you want your result (updated, new result only, or all the results) depends on the mode of your operations (Complete, Update, Append). Winner of this round: Structured Streaming. 2. RDD vs. DataFrames/DataSet Another distinction can be the use case of different APIs in both streaming models. In summary, we read that Spark Streaming works on the DStream API, which is internally using RDDs and Structured Streaming uses DataFrame and Dataset APIs to perform streaming operations. So, it is a straight comparison between using RDDs or DataFrames. There are several blogs available which compare DataFrames and RDDs in terms of `performance` and `ease of use.` This is a good read for RDD v/s Dataframes. All those comparisons lead to one result: that DataFrames are more optimized in terms of processing and provide more options for aggregations and other operations with a variety of functions available (many more functions are now supported natively in Spark 2.4). So Structured Streaming wins here with flying colors. 3. Processing With the Vent Time, Handling Late Data One big issue in the streaming world is how to process data according to the event-time. Event-time is the time when the event actually happened. It is not necessary for the source of the streaming engine to prove data in real-time. There may be latencies in data generation and handing over the data to the processing engine. There is no such option in Spark Streaming to work on the data using the event-time. It only works with the timestamp when the data is received by the Spark. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss. On the other hand, Structured Streaming provides the functionality to process data on the basis of event-time when the timestamp of the event is included in the data received. This is a major feature introduced in Structured Streaming which provides a different way of processing the data according to the time of data generation in the real world. With this, we can handle data coming in late and get more accurate results. With event-time handling of late data, Structured Streaming outweighs Spark Streaming. 4. End-to-End Guarantees Every application requires fault tolerance and end-to-end guarantees of data delivery. Whenever the application fails, it must be able to restart from the same point where it failed in order to avoid data loss and duplication. To provide fault tolerance, Spark Streaming and Structured Streaming both use the checkpointing to save the progress of a job. But this approach still has many holes which may cause data loss. Other than checkpointing, Structured Streaming has applied two conditions to recover from any error: The source must be replayable. The sinks must support idempotent operations to support reprocessing in case of failures. Here's a link to the docs to learn more. With restricted sinks, Spark Structured Streaming always provides end-to-end, exactly once semantics. Way to go Structured Streaming! 5. Restricted or Flexible Sink: The destination of a streaming operation. It can be external storage, a simple output to console, or any action With Spark Streaming, there is no restriction to use any type of sink. Here we have the method foreachRDD to perform some action on the stream. This method returns us the RDDs created by each batch one-by-one and we can perform any actions over them, like saving to storage or performing some computations. We can cache an RDD and perform multiple actions on it as well (even sending the data to multiple databases). But in Structures Streaming, until v2.3, we had a limited number of output sinks and, with one sink, only one operation could be performed and we could not save the output to multiple external storages. To use a custom sink, the user needed to implement ForeachWriter. But here comes Spark 2.4, and with it we get a new sink called foreachBatch. This sink gives us the resultant output table as a DataFrame and hence we can use this DataFrame to perform our custom operations. With this new sink, the `restricted` Structured Streaming is now more `flexible` and gives it an edge over the Spark Streaming and other over flexible sinks.
  4. Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. ▪  Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. ▪  It provides results that are accurate, even in the case of out-of-order or late-arriving data ▪  It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state ▪  It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics ▪  Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
  5. Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. ▪  Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. ▪  It provides results that are accurate, even in the case of out-of-order or late-arriving data ▪  It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state ▪  It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics ▪  Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
  6. Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. ▪  Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. ▪  It provides results that are accurate, even in the case of out-of-order or late-arriving data ▪  It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state ▪  It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics ▪  Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
  7. Spark SQL (DB, Json, case class)
  8. Spark SQL (DB, Json, case class)
  9. Spark SQL (DB, Json, case class)
  10. Spark SQL (DB, Json, case class)
  11. Spark SQL (DB, Json, case class)
  12. Spark SQL (DB, Json, case class)
  13. Spark SQL (DB, Json, case class)
  14. Spark SQL (DB, Json, case class)
  15. Spark SQL (DB, Json, case class)