SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
© 2017 Hazelcast Inc.
Stream processing
and real-time data pipelines
1
Vladimir Schreiner | vladimir@hazelcast.com | CS HUG Prague | 29th June 2018
2
© 2017 Hazelcast Inc.
What is Hazelcast?
HAZELCAST IMDG is a Distributed
In-Memory Store
Cache, KV-Store, Messaging
Implements standard collections API
in distributed way
Clients in Java, .NET, NodeJS, Go,
Pythoon, C++
From 2008
HAZELCAST JET is a general
purpose distributed data
processing engine built on
Hazelcast and based on directed
acyclic graphs (DAGs) to model data
flow for low latency batch and stream
processing.
From 2015
© 2017 Hazelcast Inc.
Stream Processing
Why should I bother?
© 2017 Hazelcast Inc.
What is stream processing?
Data Processing: Massage the data when moving
from place to place.
On-Line systems – request/response, small volumes, low-latency
Batch Processing – data in / data out, big volumes, huge latency
Stream processing – data in / data out, big volumes, low-latency
© 2017 Hazelcast Inc.
From Batch to Stream
© 2017 Hazelcast Inc.
From Batch to Stream
© 2017 Hazelcast Inc.
When to Use Stream Processing
• Real-time analytics
• Monitoring, Fraud, Anomalies, Pattern detection,
Prediction
• Event-Driven Architectures
• Real-Time ETL
• Moving batch tasks to near real-time
• Continuous data
• Consistent Resource Consumption (1GB/sec -> 86TB/day)
© 2017 Hazelcast Inc.
Streaming is nothing brand new
Complex Event Processing – early 2000‘s
Materialised Views – 1998 (Oracle 8i)
UNIX Pipelines?
Scale is what‘s new!
© 2017 Hazelcast Inc.
What modern SPE can do for you?
• Offers high-level API to implement the processing pipeline
• map, filter, groupBy, aggregate, join …
• Offers connectors to read and write the data
• Kafka, HDFS, JDBC, JMS …
• You implement the data pipeline and submit it to the SPE
• Executes the pipeline in a parallel and distributed environment
• Moves the data through the pipeline (partitioning, shuffling, backpressure)
• Is fault-tolerant (survives failures) and elastic
• Monitoring, diagnostics etc.
© 2017 Hazelcast Inc.
England – Belgium
England amazing! #ENGBEL ENG win +1, BEL lost +1
Belgium hopeless, it‘s so bad #ENGBEL ENG win +2, BEL lost + 2
My guess is draw for England - Belgium #ENGBEL Both draw + 1
© 2017 Hazelcast Inc.
Streaming and Hadoop
Major Evolutionary Steps
© 2017 Hazelcast Inc.
1st Gen: MapReduce
• Google File System Paper, 2003
https://static.googleusercontent.com/media/research.google.com/en//archi
ve/gfs-sosp2003.pdf
• MapReduce paper, 2004
https://static.googleusercontent.com/media/research.google.com/en//archi
ve/mapreduce-osdi04.pdf
• Apache Hadoop founded by Doug Cutting and Mike Cafarella at Yahoo
2006
• Commercial Open Source Distros: Cloudera, Hortonworks, MapR
• Lots of additions to the ecosystem
© 2017 Hazelcast Inc.
• Apache Spark started as a research project at UC Berkeley in
the AMPLab, which focuses on big data analytics, in 2010.
• Goal was to design a programming model that supports a much wider
class of applications than MapReduce. Introduced DAG
2nd Gen: Spark
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
© 2017 Hazelcast Inc.
• Fault-Tolerance of Spark designed with for batch
• Spark Streaming Paper, 2012. Stream as a sequence of micro-batches
• Spark has now moved on to DataFrames, Tungsten and Spark Streaming
and its architecture continues to evolve so it is also 3rd Gen (Continuous).
2nd Gen: Spark Streaming
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
© 2017 Hazelcast Inc.
3rd Gen: Continuous streaming
• DAG based
• Streaming based. Not micro-batch
• Batch is a simply streaming with bounds
• Learns from previous systems
• Informed by academic papers in the last decade, such as the
Google[1][2] but many more
• Plethora of Choice: Apache Storm, Twitter Heron, Apache Flink,
Kafka Streams, Google DataFlow, Hazelcast Jet, Spark Continuous
Streaming
[1] MillWheel: Fault-Tolerant Stream Processing at Internet Scale
[2] FlumeJava: Easy, Efficient Data-Parallel Pipelines
© 2017 Hazelcast Inc.
Big Data Trends
© 2017 Hazelcast Inc.
Hadoop – Declining Interest
© 2017 Hazelcast Inc.
Hadoop – Obsolete?
https://www.gartner.com/newsroom/id/3809163
© 2017 Hazelcast Inc.
Spark Dominant but Stable
© 2017 Hazelcast Inc.
The Rest
© 2017 Hazelcast Inc.
Challenges of Stream Processing
© 2017 Hazelcast Inc.
1 - Infinite input
• Some operations need bounded data (aggregate, join)
• Slowest response within last 10 secs x Slowest response
since the app was started
• Windows - a bounded view on top of an infinite stream.
Scope of a computation defined in code. Not in the data
(batch) or by operational settings (micro batch for
streaming)!
© 2017 Hazelcast Inc.
Tumbling windows
• Continuous stream divided into discrete parts.
• Defined by a time duration or record count.
© 2017 Hazelcast Inc.
Sliding windows
• Fixed-size view that slides with the time advancing.
• Fixed-size - they can overlap.
• Defined by a size (time duration, record count) and shift
© 2017 Hazelcast Inc.
Session windows
• Burst of activity followed by period of inactivity (timeout).
• Collects the activity belonging to the same session.
• It’s dynamic - no fixed start or duration.
© 2017 Hazelcast Inc.
2 – Late Events
• Wall clock time x Event time
• Event time – more natural, unordered data
• Long or short waiting? Latency, memory X correctnes.
• Heuristic helping you assessing window completenes
© 2017 Hazelcast Inc.
3 - Fault Tolerance
• Streaming jobs can be running for a long time! What if
something crashes?
• System back-ups!
• It’s not that simple: you still need to coordinate the states of
different steps in the computation: Distributed snapshots
• Chandy-Lamport distributed snapshotting algorithm[1]
[1] - https://dl.acm.org/citation.cfm?id=214456
© 2017 Hazelcast Inc.
Distributed snapshots
Reader Writer
Reader
Reader
Reader
Writer
Snapshot Done!
© 2017 Hazelcast Inc.
4 - Complexity
There is a lot of moving parts:
• Replayable data source (2x Kafka + 3x Zookeeper)
• Stream Processing Engine (2x)
• Cluster manager (YARN, Mesos, Kubernetes)
• Highly-Available Snapshot Storage (HDFS) (3x)
• Data sink for results
You Are Not Google, are you?
© 2017 Hazelcast Inc.
Jet Favours Speed
*
* Spark had all performance options turn on including Tungsten https://hazelcast.com/resources/jet-0-4-streaming-benchmark/
© 2017 Hazelcast Inc.
Jet Favours Simplicity
© 2017 Hazelcast Inc.
Jet Application Deployment Options
IMDG IMDG
• No separate process to manage
• Great for microservices
• Great for OEM
• Simplest for Ops – nothing extra
Embedded
Application
Java API
Application
Java API
Application
Java API
Java Client
Application
Client-Server
Jet
Jet Jet
Jet
Jet Jet
Java Client
Application
Java Client
Application
Java Client
Application
• Separate Jet Cluster
• Scale Jet independent of applications
• Isolate Jet from application server lifecycle
• Managed by Ops
© 2017 Hazelcast Inc.
Our game prediction!
© 2017 Hazelcast Inc.
Flight Telemetry
https://github.com/hazelcast/hazelcast-jet-demos
© 2017 Hazelcast Inc.
Flight Telemetry
https://github.com/hazelcast/hazelcast-jet-demos
© 2017 Hazelcast Inc.
Demo Applications
Real-time Image Recognition Twitter Cryptocurrency Sentiment Analysis
Recognizes images present in the webcam video
input with a model trained with
CIFAR-10 dataset.
Twitter content is analyzed in real time to calculate
cryptocurrency trend list with popularity index.
Real-Time Road Traffic Analysis
And Prediction
Real-time Sports Betting Engine
Continuously computes linear regression models
from current traffic. Uses the trend from week ago
to predict traffic now
This is a simple example of a sports book and is a
good introduction to the Pipeline API.
It also uses Hazelcast IMDG as an in-memory data
store.
Flight Telemetry Market Data Ingest
Reads a stream of telemetry data from ADB-S on all
commercial aircraft flying anywhere in the world.
There is typically 5,000 - 6,000 aircraft at any point
in time. This is then filtered, aggregated and certain
features are enriched and displayed in Grafana.
Uploads a stream of stock market data (prices) from
a Kafka topic into an IMDG map. Data is analyzed as
part of the upload process, calculating the moving
averages to detect buy/sell indicators. Input data
here is manufactured to ensure such indicators
exist, but this is easy to reconnect to real input.
Markov Chain Generator Real-Time Trade Processing
Generates a Markov Chain with probabilities based
on supplied classical books.
Processes immutable events from an event bus
(Kafka) to update storage optimized for querying
and reading (IMDG).
© 2017 Hazelcast Inc.
Hazelcast Jet. Try it!
36
• High Performance | Industry Leading Performance
• Works great with Hazelcast IMDG | Source, Sink, Enrichment
• Very simple to program | Leverages existing standards
• Very simple to deploy | embed 10MB jar or Client Server
• Works in every Cloud | Same as Hazelcast IMDG
• For Developers by Developers | Code it
© 2017 Hazelcast Inc.
Questions?
Version 0.6.1 is the current release with 0.7 coming in September
aiming for 1.0 this year
http://jet.hazelcast.org
https://github.com/hazelcast/hazelcast-jet-demos/

Mais conteúdo relacionado

Mais procurados

Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityDataWorks Summit
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analyticsDataWorks Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightDataWorks Summit
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAmazon Web Services
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataDatabricks
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 

Mais procurados (20)

Data Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data SecurityData Gloveboxes: A Philosophy of Data Science Data Security
Data Gloveboxes: A Philosophy of Data Science Data Security
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 

Semelhante a Stream Processing and Real-Time Data Pipelines

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016Eduard Lazar
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...DataStax
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics systemModusOptimum
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 

Semelhante a Stream Processing and Real-Time Data Pipelines (20)

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
True Reusable Code - DevSum2016
True Reusable Code - DevSum2016True Reusable Code - DevSum2016
True Reusable Code - DevSum2016
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 

Último

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Último (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

Stream Processing and Real-Time Data Pipelines

  • 1. © 2017 Hazelcast Inc. Stream processing and real-time data pipelines 1 Vladimir Schreiner | vladimir@hazelcast.com | CS HUG Prague | 29th June 2018 2
  • 2. © 2017 Hazelcast Inc. What is Hazelcast? HAZELCAST IMDG is a Distributed In-Memory Store Cache, KV-Store, Messaging Implements standard collections API in distributed way Clients in Java, .NET, NodeJS, Go, Pythoon, C++ From 2008 HAZELCAST JET is a general purpose distributed data processing engine built on Hazelcast and based on directed acyclic graphs (DAGs) to model data flow for low latency batch and stream processing. From 2015
  • 3. © 2017 Hazelcast Inc. Stream Processing Why should I bother?
  • 4. © 2017 Hazelcast Inc. What is stream processing? Data Processing: Massage the data when moving from place to place. On-Line systems – request/response, small volumes, low-latency Batch Processing – data in / data out, big volumes, huge latency Stream processing – data in / data out, big volumes, low-latency
  • 5. © 2017 Hazelcast Inc. From Batch to Stream
  • 6. © 2017 Hazelcast Inc. From Batch to Stream
  • 7. © 2017 Hazelcast Inc. When to Use Stream Processing • Real-time analytics • Monitoring, Fraud, Anomalies, Pattern detection, Prediction • Event-Driven Architectures • Real-Time ETL • Moving batch tasks to near real-time • Continuous data • Consistent Resource Consumption (1GB/sec -> 86TB/day)
  • 8. © 2017 Hazelcast Inc. Streaming is nothing brand new Complex Event Processing – early 2000‘s Materialised Views – 1998 (Oracle 8i) UNIX Pipelines? Scale is what‘s new!
  • 9. © 2017 Hazelcast Inc. What modern SPE can do for you? • Offers high-level API to implement the processing pipeline • map, filter, groupBy, aggregate, join … • Offers connectors to read and write the data • Kafka, HDFS, JDBC, JMS … • You implement the data pipeline and submit it to the SPE • Executes the pipeline in a parallel and distributed environment • Moves the data through the pipeline (partitioning, shuffling, backpressure) • Is fault-tolerant (survives failures) and elastic • Monitoring, diagnostics etc.
  • 10. © 2017 Hazelcast Inc. England – Belgium England amazing! #ENGBEL ENG win +1, BEL lost +1 Belgium hopeless, it‘s so bad #ENGBEL ENG win +2, BEL lost + 2 My guess is draw for England - Belgium #ENGBEL Both draw + 1
  • 11. © 2017 Hazelcast Inc. Streaming and Hadoop Major Evolutionary Steps
  • 12. © 2017 Hazelcast Inc. 1st Gen: MapReduce • Google File System Paper, 2003 https://static.googleusercontent.com/media/research.google.com/en//archi ve/gfs-sosp2003.pdf • MapReduce paper, 2004 https://static.googleusercontent.com/media/research.google.com/en//archi ve/mapreduce-osdi04.pdf • Apache Hadoop founded by Doug Cutting and Mike Cafarella at Yahoo 2006 • Commercial Open Source Distros: Cloudera, Hortonworks, MapR • Lots of additions to the ecosystem
  • 13. © 2017 Hazelcast Inc. • Apache Spark started as a research project at UC Berkeley in the AMPLab, which focuses on big data analytics, in 2010. • Goal was to design a programming model that supports a much wider class of applications than MapReduce. Introduced DAG 2nd Gen: Spark http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 14. © 2017 Hazelcast Inc. • Fault-Tolerance of Spark designed with for batch • Spark Streaming Paper, 2012. Stream as a sequence of micro-batches • Spark has now moved on to DataFrames, Tungsten and Spark Streaming and its architecture continues to evolve so it is also 3rd Gen (Continuous). 2nd Gen: Spark Streaming http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 15. © 2017 Hazelcast Inc. 3rd Gen: Continuous streaming • DAG based • Streaming based. Not micro-batch • Batch is a simply streaming with bounds • Learns from previous systems • Informed by academic papers in the last decade, such as the Google[1][2] but many more • Plethora of Choice: Apache Storm, Twitter Heron, Apache Flink, Kafka Streams, Google DataFlow, Hazelcast Jet, Spark Continuous Streaming [1] MillWheel: Fault-Tolerant Stream Processing at Internet Scale [2] FlumeJava: Easy, Efficient Data-Parallel Pipelines
  • 16. © 2017 Hazelcast Inc. Big Data Trends
  • 17. © 2017 Hazelcast Inc. Hadoop – Declining Interest
  • 18. © 2017 Hazelcast Inc. Hadoop – Obsolete? https://www.gartner.com/newsroom/id/3809163
  • 19. © 2017 Hazelcast Inc. Spark Dominant but Stable
  • 20. © 2017 Hazelcast Inc. The Rest
  • 21. © 2017 Hazelcast Inc. Challenges of Stream Processing
  • 22. © 2017 Hazelcast Inc. 1 - Infinite input • Some operations need bounded data (aggregate, join) • Slowest response within last 10 secs x Slowest response since the app was started • Windows - a bounded view on top of an infinite stream. Scope of a computation defined in code. Not in the data (batch) or by operational settings (micro batch for streaming)!
  • 23. © 2017 Hazelcast Inc. Tumbling windows • Continuous stream divided into discrete parts. • Defined by a time duration or record count.
  • 24. © 2017 Hazelcast Inc. Sliding windows • Fixed-size view that slides with the time advancing. • Fixed-size - they can overlap. • Defined by a size (time duration, record count) and shift
  • 25. © 2017 Hazelcast Inc. Session windows • Burst of activity followed by period of inactivity (timeout). • Collects the activity belonging to the same session. • It’s dynamic - no fixed start or duration.
  • 26. © 2017 Hazelcast Inc. 2 – Late Events • Wall clock time x Event time • Event time – more natural, unordered data • Long or short waiting? Latency, memory X correctnes. • Heuristic helping you assessing window completenes
  • 27. © 2017 Hazelcast Inc. 3 - Fault Tolerance • Streaming jobs can be running for a long time! What if something crashes? • System back-ups! • It’s not that simple: you still need to coordinate the states of different steps in the computation: Distributed snapshots • Chandy-Lamport distributed snapshotting algorithm[1] [1] - https://dl.acm.org/citation.cfm?id=214456
  • 28. © 2017 Hazelcast Inc. Distributed snapshots Reader Writer Reader Reader Reader Writer Snapshot Done!
  • 29. © 2017 Hazelcast Inc. 4 - Complexity There is a lot of moving parts: • Replayable data source (2x Kafka + 3x Zookeeper) • Stream Processing Engine (2x) • Cluster manager (YARN, Mesos, Kubernetes) • Highly-Available Snapshot Storage (HDFS) (3x) • Data sink for results You Are Not Google, are you?
  • 30. © 2017 Hazelcast Inc. Jet Favours Speed * * Spark had all performance options turn on including Tungsten https://hazelcast.com/resources/jet-0-4-streaming-benchmark/
  • 31. © 2017 Hazelcast Inc. Jet Favours Simplicity
  • 32. © 2017 Hazelcast Inc. Jet Application Deployment Options IMDG IMDG • No separate process to manage • Great for microservices • Great for OEM • Simplest for Ops – nothing extra Embedded Application Java API Application Java API Application Java API Java Client Application Client-Server Jet Jet Jet Jet Jet Jet Java Client Application Java Client Application Java Client Application • Separate Jet Cluster • Scale Jet independent of applications • Isolate Jet from application server lifecycle • Managed by Ops
  • 33. © 2017 Hazelcast Inc. Our game prediction!
  • 34. © 2017 Hazelcast Inc. Flight Telemetry https://github.com/hazelcast/hazelcast-jet-demos © 2017 Hazelcast Inc. Flight Telemetry https://github.com/hazelcast/hazelcast-jet-demos
  • 35. © 2017 Hazelcast Inc. Demo Applications Real-time Image Recognition Twitter Cryptocurrency Sentiment Analysis Recognizes images present in the webcam video input with a model trained with CIFAR-10 dataset. Twitter content is analyzed in real time to calculate cryptocurrency trend list with popularity index. Real-Time Road Traffic Analysis And Prediction Real-time Sports Betting Engine Continuously computes linear regression models from current traffic. Uses the trend from week ago to predict traffic now This is a simple example of a sports book and is a good introduction to the Pipeline API. It also uses Hazelcast IMDG as an in-memory data store. Flight Telemetry Market Data Ingest Reads a stream of telemetry data from ADB-S on all commercial aircraft flying anywhere in the world. There is typically 5,000 - 6,000 aircraft at any point in time. This is then filtered, aggregated and certain features are enriched and displayed in Grafana. Uploads a stream of stock market data (prices) from a Kafka topic into an IMDG map. Data is analyzed as part of the upload process, calculating the moving averages to detect buy/sell indicators. Input data here is manufactured to ensure such indicators exist, but this is easy to reconnect to real input. Markov Chain Generator Real-Time Trade Processing Generates a Markov Chain with probabilities based on supplied classical books. Processes immutable events from an event bus (Kafka) to update storage optimized for querying and reading (IMDG).
  • 36. © 2017 Hazelcast Inc. Hazelcast Jet. Try it! 36 • High Performance | Industry Leading Performance • Works great with Hazelcast IMDG | Source, Sink, Enrichment • Very simple to program | Leverages existing standards • Very simple to deploy | embed 10MB jar or Client Server • Works in every Cloud | Same as Hazelcast IMDG • For Developers by Developers | Code it
  • 37. © 2017 Hazelcast Inc. Questions? Version 0.6.1 is the current release with 0.7 coming in September aiming for 1.0 this year http://jet.hazelcast.org https://github.com/hazelcast/hazelcast-jet-demos/