Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.
2. Knoldus Inc.
Blue Pill / Red Pill :
The Matrix of thousands
of data streams
#UnifiedDataAnalytics #SparkAISummit
3. ●
My name is Himanshu Gupta
●
Lead Consultant at Knoldus Inc.
●
Twitter: @himanshug735
●
LinkedIn: https://www.linkedin.com/in/himanshu-gupta-25189629/
3#UnifiedDataAnalytics #SparkAISummit
About Me
7. 7#UnifiedDataAnalytics #SparkAISummit
Benefits of Real-Time Data
●
In 2014, real-time data analysis reduced crude mortality rate from 7.75% to
6.42% in Queen Alexandra Hospital in Portsmouth and University Hospital
Coventry.
●
World's largest Hedge Fund, Bridgewater, uses Twitter For Real-Time
Economic Modeling.
11. 11#UnifiedDataAnalytics #SparkAISummit
Stream Data
●
Streaming data into 1000s of streams is a resource intensive process.
●
Since streaming requires dedicated resources, the number of streams
supported by a system gets limited by the resources available.
●
However, if combined, streams can be managed much more efficiently.
●
Also, starting/stopping a stream becomes easy since data is managed by
group.
12. 12#UnifiedDataAnalytics #SparkAISummit
Group Data
For example, consider a Power
plant which has 100s of devices
emitting data in real-time. The data
contains information about different
parameters of device like
temperature, speed, etc. Since the
data is coming from one source
(power plant) it becomes a good
candidate for grouping data into one
stream.
13. 13#UnifiedDataAnalytics #SparkAISummit
Output
As Kafka is being used, the result of combining data from different streams into
one looks like above. Where one key represents one device of the power plant
from previous example.
15. 15#UnifiedDataAnalytics #SparkAISummit
Use Spark
Since the introduction of
Structured Streaming in
Apache Spark 2.0, the way
processing streams has
changed a lot. As it has
brought a lot of features
which were earlier unheard.
17. 17#UnifiedDataAnalytics #SparkAISummit
Store Data
●
Storing data might look an easy task but
it is not.
●
Because after analysis of multiple data
sources is done it is difficult to
materialize it and save it in different
locations.
●
And, also saving in such a way that
retrieving data becomes Easy.