5. Use case: IoT Device Monitoring
IoT
Devices
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Process based on
eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning
6. 1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperatestreaming with batch AND interactive
- RDD/DStream hassimilar API, butstill requirestranslation
- Hard to interoperate with DataFrames and Datasets
3. Reasoning about end-to-end guarantees
- Requirescarefully constructing sinks that handle failurescorrectly
- Data consistency in the storage while being updated
Pain points with DStreams
8. The simplest way to perform streaming analytics
is not having to reason about streaming at all
9. Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: howfrequently to check
input for newdata
Query: operations on input
usual map/filter/reduce
newwindow, session ops
10. Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output
11. Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output: Write only new rows
*Not all output modes are feasible withall queries
13. Batch ETL with DataFrame
input = ctxt.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file
14. Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
Read from Json file stream
Select some devices
Write to parquet file stream
15. Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing
16. Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
17. Continuous Aggregations
Continuously compute average
signal of each type of device
17
input.groupBy("device-type")
.avg("signal")
input.groupBy(
window("event-time", "10min"),
"device type")
.avg("age")
Continuously compute average signal of
each type of device in last10 minutesof
eventtime
- Windowing is just a type of aggregation
- Simple API for event time based windowing
18. Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
18
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack
19. Logically:
DataFrame operations on table
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
LogicalPlan
Continuous,
incrementalexecution
Catalyst optimizer
Execution
20. Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesin Datasets/DataFrames
Eventtime, windowing,sessions,sources& sinks
End-to-end exactly once semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML models
21. Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and interactive
3. End-to-endexactly-once guaranteesfromthe system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
21
23. Batch Execution on Spark SQL
23
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Logical plan
optimization
Execution plan
generation
RDD job
Tungsten
Code generation
Abstract
representation
of query
24. Continuous Incremental Execution
Planner extendedto be aware of the
streaminglogical plans
Planner generatesa continuousseriesof
incrementalexecutionplans, each
processingthe next chunk ofstreaming data
24
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4
25. Streaming Source
A streaming data source where
- Records can be uniquely identified by a offset
- Arbitrary segmentof the stream data can be read
based on an offset range
- Files,Kafka, Kinesissupported
More restricted than DStream Receivers
Can ensure end-to-end exactly-once guarantees
25
Logical Plan
Sources
Sink
Operations
26. Streaming Sink
Allows output to pushed to externalstorage
Each batch of output should be written to
storage atomically and idempotently
Both neededto ensureend-to-end exactly-
once guarantees,AND data consistency
26
Logical Plan
Sources
Sink
Operations
27. Incremental Execution
Every trigger interval
- Plannerask source for
next chunkof inputdata
- Generatesexecution
plan with the input
- Generate output data
- Handsto sink for
pushing itout
27
Logical Plan
Sources
Sink
Ops
Incremental
Execution Plan
Input
get new data
as input
Output
push output
generate
outputPlanner
28. Offset Tracking with WAL
Plannersaves the next
offset range in a write-
ahead log (WAL) on
HDFS/S3 before
incremental execution
starts
28
Logical Plan
Sources
Incremental
Execution Plan
Planner
processed
offsets
Offset WAL
in-progress
offsets
save offset range to WAL
before processing starts
Input
29. Recovery from WAL
29
Logical Plan
Sources
Incremental
Execution PlanRestarted
Planner
processed
offsets
in-progress
offsets
Offset WAL
recover last offset range
and restart computation
After failure, restarted
plannerrecoverslast
offset range from WAL
and restarts failed
execution
Output exactly same as
it would been without
failure
Input
31. Stateful Stream Processing
Streaming aggregationsrequire maintaining intermediate
"state" data acrossbatches
31
compute
aggregates 3
Dog
Cat 2
Dog 1
Cat 3
Dog 1
Cow 1
state data
compute
aggregates 2
Cat, Cow
compute
aggregates 1
Cat, Dog, Cat
aggregates
32. State Management
State needsto be fault-tolerant
Cat, Dog, Cat Cat, Cow Dog
Cat 2
Dog 1
Cat 3
Dog 1
Cow 1aggregates
both, input + previous
state needed to
recover from failure
Cat 3
Dog 1
Cow 1
Dog
compute
aggregates 1
compute
aggregates 2
compute
aggregates 3
33. Old State Management with DStreams
DStreams (i.e. updateStateByKey,mapWithState)represented
state data as RDDs
Leveraged RDD lineage for fault-tolerance and RDD
checkpointing to prevent unbounded
RDD checkpointing savesall state data to HDFS/S3
Inefficientwhen update ratesare low
No "incremental"checkpointing
33
34. New State Management: State Store
State Store API: any versionedkey value store that
- Allowsa set of key-value updatesto be transactionally committed
with a version number(i.e. incremental checkpointing)
- Allowsa specific version of the key-value data to be retrieved
34
Incremental
Execution 1
state store, v1
Incremental
Execution 2
Incremental
Execution 3
state store, v2
35. Spark cluster
HDFS-backed State Store
Implementation of State Store API in Spark 2.0:
In-memory hashmap backed by files in HDFS
35
Driver Executor
Executor
hash
map
hash
map
- Each executorhas hashmap(s)
containing versioned data
- Updatescommitted as delta
filesin HDFS/S3
- Delta filesperiodically
collapsed into snapshotsto
improve recovery deltafiles
in HDFS
k1 v1
k2 v2
k1 v1
k2 v2
36. Fault recovery of HDFS State Store
State Store version also stored in the WAL
36
On fault recovery,
- Recoverinputdata from
source using lastoffsets
- Recoverstate data using
last store version
Incremental
Execution
In
progress WAL
sources
store files
recovered
state
recovered
input
39. Plan
Spark 2.1+
More supportfor late data
Dynamic scaling
Sourceand sinks public API
- Extends DataSource API
More sourcesand sinks
ML integrations
Spark XXX
"Truestreaming" engine
- API already agnostic to
micro-batch
40. 40
Simple and fast real-time analytics
• Develop Productively
• Execute Efficiently
• Update Automatically
Questions?
Learn More
Today, 1:50-2:30 AMA
Followme @tathadas