Taking Streaming to the Next Level with Datasets and DataFrames

Taking Spark Streaming to the next
level with Datasets and DataFrames
Tathagata “TD” Das
@tathadas
Strata San Jose 2016

Streaming in Spark
Spark Streaming changedhow peoplewrite streaming apps
2
SQL Streaming MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 50%users consider most important partof Spark

Streaming apps are
growing more complex
3

Streaming computations
don’t run in isolation
Need to interact with batch data,
interactive analysis, machine learning, etc.

Use case: IoT Device Monitoring
IoT
Devices
ETL into long term storage
- Preventdata loss
- PreventduplicatesStatus monitoring
- Handlelate data
- Process based on
eventtime
Interactively
debug issues
- consistency
event stream
Anomaly detection
- Learn modelsoffline
- Use online+continuous
learning

1. Processing with event-time, dealing with late data
- DStream API exposes batch time, hard to incorporate event-time
2. Interoperatestreaming with batch AND interactive
- RDD/DStream hassimilar API, butstill requirestranslation
- Hard to interoperate with DataFrames and Datasets
3. Reasoning about end-to-end guarantees
- Requirescarefully constructing sinks that handle failurescorrectly
- Data consistency in the storage while being updated
Pain points with DStreams

The simplest way to perform streaming analytics
is not having to reason about streaming at all

Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: howfrequently to check
input for newdata
Query: operations on input
usual map/filter/reduce
newwindow, session ops

1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output

1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output: Write only new rows
*Not all output modes are feasible withall queries

Static,
bounded table
Dataset, DataFrame
Streaming,
unbounded table
Single API !

Batch ETL with DataFrame
input = ctxt.read
.format("json")
.load("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.save("dest-path")
Read from Json file
Select some devices
Write to parquet file

Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
Read from Json file stream
Select some devices
Write to parquet file stream

input = ctxt.read
.format("json")
result = input
result.write
.format("parquet")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing

input = ctxt.read
.format("json")
result = input
result.write
.format("parquet")
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3

Continuous Aggregations
Continuously compute average
signal of each type of device
17
input.groupBy("device-type")
.avg("signal")
input.groupBy(
window("event-time", "10min"),
"device type")
.avg("age")
Continuously compute average signal of
each type of device in last10 minutesof
eventtime
- Windowing is just a type of aggregation
- Simple API for event time based windowing

Query Management
query = result.write
.format("parquet")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
18
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack

Logically:
DataFrame operations on table
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
LogicalPlan
Continuous,
incrementalexecution
Catalyst optimizer
Execution

Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesin Datasets/DataFrames
Eventtime, windowing,sessions,sources& sinks
End-to-end exactly once semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML models

Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and interactive
3. End-to-endexactly-once guaranteesfromthe system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
21

Batch Execution on Spark SQL
23
DataFrame/
Dataset
Logical
Plan
Execution PlanPlanner
Logical plan
optimization
Execution plan
generation
RDD job
Tungsten
Code generation
Abstract
representation
of query

Continuous Incremental Execution
Planner extendedto be aware of the
streaminglogical plans
Planner generatesa continuousseriesof
incrementalexecutionplans, each
processingthe next chunk ofstreaming data
24
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4

Streaming Source
A streaming data source where
- Records can be uniquely identified by a offset
- Arbitrary segmentof the stream data can be read
based on an offset range
- Files,Kafka, Kinesissupported
More restricted than DStream Receivers
Can ensure end-to-end exactly-once guarantees
25
Logical Plan
Sources
Sink
Operations

Streaming Sink
Allows output to pushed to externalstorage
Each batch of output should be written to
storage atomically and idempotently
Both neededto ensureend-to-end exactly-
once guarantees,AND data consistency
26
Logical Plan
Sources
Sink
Operations

Incremental Execution
Every trigger interval
- Plannerask source for
next chunkof inputdata
- Generatesexecution
plan with the input
- Generate output data
- Handsto sink for
pushing itout
27
Logical Plan
Sources
Sink
Ops
Incremental
Execution Plan
Input
get new data
as input
Output
push output
generate
outputPlanner

Offset Tracking with WAL
Plannersaves the next
offset range in a write-
ahead log (WAL) on
HDFS/S3 before
incremental execution
starts
28
Logical Plan
Sources
Incremental
Execution Plan
Planner
processed
offsets
Offset WAL
in-progress
offsets
save offset range to WAL
before processing starts
Input

Recovery from WAL
29
Logical Plan
Sources
Incremental
Execution PlanRestarted
Planner
processed
offsets
in-progress
offsets
Offset WAL
recover last offset range
and restart computation
After failure, restarted
plannerrecoverslast
offset range from WAL
and restarts failed
execution
Output exactly same as
it would been without
failure
Input

30
Streaming source
+
Streaming sink
+
Offset tracking
=
End-to-end
exactly-once
guarantees

Stateful Stream Processing
Streaming aggregationsrequire maintaining intermediate
"state" data acrossbatches
31
compute
aggregates 3
Dog
Cat 2
Dog 1
Cat 3
Dog 1
Cow 1
state data
compute
aggregates 2
Cat, Cow
compute
aggregates 1
Cat, Dog, Cat
aggregates

State Management
State needsto be fault-tolerant
Cat, Dog, Cat Cat, Cow Dog
Cat 2
Dog 1
Cat 3
Dog 1
Cow 1aggregates
both, input + previous
state needed to
recover from failure
Cat 3
Dog 1
Cow 1
Dog
compute
aggregates 1
compute
aggregates 2
compute
aggregates 3

Old State Management with DStreams
DStreams (i.e. updateStateByKey,mapWithState)represented
state data as RDDs
Leveraged RDD lineage for fault-tolerance and RDD
checkpointing to prevent unbounded
RDD checkpointing savesall state data to HDFS/S3
Inefficientwhen update ratesare low
No "incremental"checkpointing
33

New State Management: State Store
State Store API: any versionedkey value store that
- Allowsa set of key-value updatesto be transactionally committed
with a version number(i.e. incremental checkpointing)
- Allowsa specific version of the key-value data to be retrieved
34
Incremental
Execution 1
state store, v1
Incremental
Execution 2
Incremental
Execution 3
state store, v2

Spark cluster
HDFS-backed State Store
Implementation of State Store API in Spark 2.0:
In-memory hashmap backed by files in HDFS
35
Driver Executor
Executor
hash
map
hash
map
- Each executorhas hashmap(s)
containing versioned data
- Updatescommitted as delta
filesin HDFS/S3
- Delta filesperiodically
collapsed into snapshotsto
improve recovery deltafiles
in HDFS
k1 v1
k2 v2
k1 v1
k2 v2

Fault recovery of HDFS State Store
State Store version also stored in the WAL
36
On fault recovery,
- Recoverinputdata from
source using lastoffsets
- Recoverstate data using
last store version
Incremental
Execution
In
progress WAL
sources
store files
recovered
state
recovered
input

37
Fast, fault-tolerant, exactly-once
stateful stream processing
without having to reason about streaming

Plan
Spark 2.0
Basic infrastructureand API
- Eventtime, windows,aggregations
Files as sourcesand sinks
- Kafka as source, SQL as sink(?)

Plan
Spark 2.1+
More supportfor late data
Dynamic scaling
Sourceand sinks public API
- Extends DataSource API
More sourcesand sinks
ML integrations
Spark XXX
"Truestreaming" engine
- API already agnostic to
micro-batch

40
Simple and fast real-time analytics
• Develop Productively
• Execute Efficiently
• Update Automatically
Questions?
Learn More
Today, 1:50-2:30 AMA
Followme @tathadas

Taking Streaming to the Next Level with Datasets and DataFrames

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Taking Streaming to the Next Level with Datasets and DataFrames

Similar to Taking Streaming to the Next Level with Datasets and DataFrames (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Taking Streaming to the Next Level with Datasets and DataFrames