SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Tathagata “TD” Das
Designing Structured
Streaming Pipelines
How to Architect Things Right
#UnifiedAnalytics #SparkAISummit
@tathadas
Structured Streaming
Distributed stream processing built on SQL engine
High throughput, second-scale latencies
Fault-tolerant, exactly-once
Great set of connectors
Philosophy: Data streams are unbounded tables
Users write batch-like code on a table
Spark will automatically run code incrementally on streams
2
Structured Streaming @
1000s of customer streaming apps
in production on Databricks
1000+ trillions of rows processed
in production
Structured Streaming
4
#UnifiedAnalytics #SparkAISummit
Example
Read JSON data from Kafka
Parse nested JSON
Store in structured Parquet table
Get end-to-end failure guarantees
ETL
Anatomy of a Streaming Query
5
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Source
Specify where to read data from
Built-in support for Files / Kafka /
Kinesis*
Can include multiple sources of
different types using join() / union()
*Available only on Databricks Runtime
creates a DataFrame
key value topic partition offset timestamp
[binary] [binary] topicA 0 345 1486087873
[binary] [binary] topicB 3 2890 1486086721
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
Anatomy of a Streaming Query
6
#UnifiedAnalytics #SparkAISummit
Transformations
Cast bytes from Kafka to a string,
parse it as a json, and generate
nested columns
100s of built-in, optimized SQL
functions like from_json
user-defined functions, lambdas,
function literals with map, flatMap…
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
Anatomy of a Streaming Query
7
Sink
Write transformed output to
external storage systems
Built-in support for Files / Kafka
Use foreach to execute arbitrary
code with the output data
Some sinks are transactional and
exactly once (e.g. files)
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Anatomy of a Streaming Query
8
#UnifiedAnalytics #SparkAISummit
Processing Details
Trigger: when to process data
- Fixed interval micro-batches
- As fast as possible micro-batches
- Continuously (new in Spark 2.3)
Checkpoint location: for tracking the
progress of the query
DataFrames,
Datasets, SQL
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Write to
Parquet
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Plan
Series of Incremental
Execution Plans
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
Anatomy of a Streaming Query
Raw data from Kafka available
as structured data in seconds,
ready for querying
spark.readStream.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.writeStream
.format("parquet")
.option("path", "/parquetTable/")
.trigger("1 minute")
.option("checkpointLocation", "…")
.start()
STRUCTURED
STREAMING
Complex ETL
My past talks: Deep dives
11
#UnifiedAnalytics #SparkAISummit
A Deep Dive into Stateful
Stream Processing in
Structured Streaming
Spark + AI Summit 2018
A Deep Dive Into
Structured Streaming
Spark Summit 2016
STRUCTURED
STREAMING
Complex ETL
This talk: Streaming Design Patterns
Zoomed out view of your data pipeline
12
STRUCTURED
STREAMING
Complex ETL
DATA INSIGHTS
Another streaming design pattern talk?
Most talks
Focus on a pure streaming
engine
Explain one way of
achieving the end goal
13
This talk
Spark is more than a streaming
engine
Spark has multiple ways of
achieving the end goal with
tunable perf, cost and quality
This talk
How to think about design
Common design patterns
How we are making this easier
14
Streaming Pipeline Design
15
Data streams Insights
????
16
????
What?
Why?
How?
What?
17
What is your input?
What is your data?
What format and system is
your data in?
What is your output?
What results do you need?
What throughput and
latency do you need?
Why?
18
Why do you want this output in this way?
Who is going to take actions based on it?
When and how are they going to consume it?
humans?
computers?
Why? Common mistakes!
#1
"I want my dashboard with counts to
be updated every second"
#2
"I want to generate automatic alerts
with up-to-the-last second count"
(but my input data is often delayed)
19
No point of updating every
second if humans are going to
take actions in minutes or hours
No point taking fast actions on
low quality data and results
Why? Common mistakes!
#3
"I want to train machine learning
models on the results"
(but my results are in a key-value store)
20
Key-value stores are not great
for large, repeated data scans
which machine learning
workloads perform
How?
21
????
How to process
the data?
????
How to store
the results?
Streaming Design Patterns
22
What?
Why?
How?Complex
ETL
Pattern 1: ETL
23
Input: unstructured input
stream from files, Kafka, etc.
What?
Why? Query latest structured data interactively or with periodic jobs
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
Output: structured
tabular data
P1: ETL
24
Convert unstructured input to
structured tabular data
Latency: few minutes
What? How?
Why?
Query latest structured data
interactively or with periodic jobs
Process: Use Structured Streaming query to transform
unstructured, dirty data
Run 24/7 on a cluster with default trigger
Store: Save to structured scalable storage that supports
data skipping, etc.
E.g.: Parquet, ORC, or even better, Delta Lake
STRUCTURED
STREAMING
ETL QUERY
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
P1: ETL
25
How?
Store: Save to
STRUCTURED
STREAMING
Read with snapshot guarantees while writes are in progress
Concurrently reprocess data with full ACID guarantees
Coalesce small files into larger files
Fix mistakes in existing data
REPROCESS
ETL QUERY
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
P1: Cheaper ETL
26
Convert unstructured input to
structured tabular data
Latency: few minutes hours
Not have clusters up 24/7
What? How?
Why?
Query latest data interactively
or with periodic jobs
Cheaper solution
Process: Still use Structured Streaming query!
Run streaming query with "trigger.once" for
processing all available data since last batch
Set up external schedule (every few hours?) to
periodically start a cluster and run one batch
STRUCTURED
STREAMING
RESTART ON
SCHEDULE
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
P1: Query faster than ETL!
27
Latency: hours seconds
What? How?
Why?
Query latest up-to-the last
second data interactively
Query data in Kafka directly using Spark SQL
Can process up to the last records received by
Kafka when the query was started
SQL
Pattern 2: Key-value output
28
Input: new data
for each key
What?
Why?
Lookup latest value for key (dashboards, websites, etc.)
OR
Summary tables for querying interactively or with periodic jobs
KEY LATEST VALUE
key1 value2
key2 value3
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Output: updated
values for each key
Aggregations (sum, count, …)
Sessionizations
P2.1: Key-value output for lookup
29
Generate updated values for keys
Latency: seconds/minutes
What? How?
Why?
Lookup latest value for key
STRUCTURED
STREAMING
Process: Use Structured Streaming with
stateful operations for aggregation
Store: Save in key-values stores optimized
for single key lookups
STATEFUL AGGREGATION LOOKUP
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.2: Key-value output for analytics
30
Generate updated values for keys
Latency: seconds/minutes
What? How?
Why?
Lookup latest value for key
Summary tables for analytics
Process: Use Structured Streaming with
Store: Save in Delta Lake!
STRUCTURED
STREAMING
STATEFUL AGGREGATION SUMMARY
stateful operations for aggregation
Managed Delta Lake supports
upserts using SQL MERGE
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.2: Key-value output for analytics
31
How?
STRUCTURED
STREAMING
STATEFUL AGGREGATION
streamingDataFrame.foreachBatch { batchOutput =>
spark.sql("""
MERGE INTO deltaTable USING batchOutput
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...
""")
}.start()
SUMMARY
stateful operations for aggregation
Managed Delta Lake supports
upserts using SQL MERGE
Coming to OSS Delta Lake soon!
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.2: Key-value output for analytics
32
How?
Stateful aggregation requires setting
watermark to drop very late data
Dropping some data leads some
inaccuracies
Stateful operations for aggregation
Managed Delta Lake supports
upserts using SQL MERGE
STRUCTURED
STREAMING
STATEFUL AGGREGATION SUMMARY
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
P2.3: Key-value aggregations for analytics
33
Generate aggregated values
for keys
Latency: hours/days
Do not drop any late data
What? How?
Why?
Summary tables for analytics
Correct
Process: ETL to structured table (no stateful aggregation)
Store: Save in Delta Lake
Post-process: Aggregate after all delayed data received
STRUCTURED
STREAMING
STATEFUL AGGREGATION
AGGREGATIONETL SUMMARY
{ "key1": "value1" }
{ "key1": "value2" }
{ "key2": "value3" }
Pattern 3: Joining multiple inputs
34
#UnifiedAnalytics #SparkAISummit
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
Input: Multiple data streams
based on common key
Output:
Combined
information
What?
P3.1: Joining fast and slow data
35
Input: One fast stream of facts
and one slow stream of
dimension changes
Output: Fast stream enriched by
data from slow stream
Example:
product sales info enriched by more
production info
What + Why?
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
How?
ETL slow stream to a dimension table
Join fast stream with snapshots of the dimension table
FAST ETL
DIMENSION
TABLE
COMBINED
TABLE
JOIN
SLOW ETL
P3.1: Joining fast and slow data
36
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
id update value
SLOW ETL
How? - Caveats
FAST ETL
JOIN COMBINED
TABLE
DIMENSION
TABLE
Store dimension table in Delta Lake
Delta Lake's versioning allows changes
to be detected and the snapshot
automatically reloaded without restart**
Better Solution
** available only in Managed Delta Lake
in Databricks Runtime
Structured Streaming by default does reload
dimension table snapshot
Changes by slow ETL wont be seen until restart
P3.1: Joining fast and slow data
37
How? - Caveats
Treat it as a "joining fast and fast data"
Better Solution
Delays in updates to dimension table can
cause joining with stale dimension data
E.g. sale of a product received even before product
table has any info on the product
P3.2: Joining fast and fast data
38
Input: Two fast streams where
either stream maybe delayed
Output: Combined info even if
one is delayed over another
Example:
ad impressions and ad clicks
Use stream-stream joins in Structured Streaming
Data will be buffered as state
Watermarks define how long to buffer before
giving up on matching
What + Why?
i
d
upd
ate
val
ue
STREAM
STREAM
JOIN
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
01:06:45 WARN id = 1 , update failed
01:06:45 INFO id=23, update success
01:06:57 INFO id=87: update postpo
…
See my previous deep dive talk for more info
Pattern 4: Change data capture
39
#UnifiedAnalytics #SparkAISummit
INSERT a, 1
INSERT b, 2
UPDATE a, 3
DELETE b
INSERT b, 4
key value
a 3
b 4
Input: Change data based
on a primary key
Output: Final table
after changes
End-to-end replication of transactional tables into analytical tables
What?
Why?
P4: Change data capture
40
How?
Use Structured Streaming and
Delta Lake!
In each batch, apply changes to
the Delta table using MERGE
MERGE in Managed Data Lake
supports UPDATE, INSERT and
DELETE
Coming soon to OSS Delta Lake!
INSERT a, 1
INSERT b, 2
UPDATE a, 3
DELETE b
INSERT b, 4
STRUCTURED
STREAMING
streamingDataFrame.foreachBatch { batchOutput =>
spark.sql("""
MERGE INTO deltaTable USING batchOutput
WHEN MATCHED ... THEN UPDATE ...
WHEN MATCHED ... THEN INSERT ... WHEN
WHEN NOT MATCHED THEN INSERT ...
""")
}.start()
See Databricks docs for more info
Pattern 5: Writing to multiple outputs
41
id val1 val2
id sum count
RAW LOGS
+
SUMMARIES
TABLE 2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
SUMMARY FOR
ANALYTICS
+
SUMMARY FOR
LOOKUP
TABLE CHANGE LOG
+
UPDATED TABLE
What?
P5: Serial or Parallel?
42
id val1 val2
id sum count
TABLE 2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
Parallel
How?
id val1 val2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
TABLE 2
Writes table 1 and reads it again
Cheap or expensive depends on
the size and format of table 1
Higher latency
Reads input twice, may have to
parse the data twice
Cheap or expensive depends on
size of raw input + parsing cost
Serial
P5: Combination!
43
Combo 1: Multiple streaming queries
id val1 val2
TABLE 1
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
TABLE 2
Do expensive parsing once, write to table 1
Do cheaper follow up processing from table 1
Good for
ETL + multiple levels of summaries
Change log + updated table
Still writing + reading table1
Compute once, write multiple times
Cheaper, but looses exactly-once guarantee
id val1 val2
TABLE 3
Combo 2: Single query + foreachBatch
EXPENSIVE
CHEAP
{ "id": 14, "name": "td", "v": 100..
{ "id": 23, "name": "by", "v": -10..
{ "id": 57, "name": "sz", "v": 34..
…
id val1 val2
LOCATION 1
id val1 val2
LOCATION 2
streamingDataFrame.foreachBatch { batchSummaryData =>
// write summary to Delta Lake
// write summary to key value store
}.start()
How?
Production Pipelines @
REPORTING
STREAMING
ANALYTICS
Bronze Tables Silver Tables Gold Tables
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
Production Pipelines @
REPORTING
STREAMING
ANALYTICS
Bronze Tables Silver Tables Gold Tables
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
Productionizing a single pipeline
requires the following
Unit testing
Integration testing
Deploying
Monitoring
Upgrading
See Burak's talk
Productizing Structured
Streaming Jobs
Production Pipelines @
Bronze Tables Silver Tables Gold TablesBut we really need to productionize the entire DAG of pipelines!
REPORTING
STREAMING
ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
Future direction
Declarative APIs to
specify and manage
entire DAGs of pipelines
47
Delta Pipelines
REPORTING
STREAMING
ANALYTICS
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
48
source(
name = "raw",
df = spark.readStream…)
sink(
name = "warehouse",
(df) => df.writeStream…
flow(
name = "ingest",
df = input("raw")
.withColumn(…),
dest = "warehouse",
latency = "5 minutes")
A small library on top of
Apache Spark that lets users
express their whole pipeline
as a dataflow graph
Delta Pipelines
49
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
STRUCTURED
STREAMING
STRUCTUREDSTREAMING
STRUCTURED
STREAMING
REPORTING
STREAMING
ANALYTICS
A small library on top of
Apache Spark that lets users
express their whole pipeline
as a dataflow graph
Specify expectations with Delta tables
Unit test DAG with sample input data
Integration test DAG with production data
Deploy/upgrade new code of the entire DAG
Rollback code and/or data when needed
Monitor entire DAG in one place
STAY TUNED!
Thanks for listening!
AMA at 4:00 PM today
#UnifiedAnalytics #SparkAISummit

Mais conteúdo relacionado

Mais procurados

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

Mais procurados (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Spark overview
Spark overviewSpark overview
Spark overview
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 

Semelhante a Designing Structured Streaming Pipelines—How to Architect Things Right

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
Angel Giraldo
 

Semelhante a Designing Structured Streaming Pipelines—How to Architect Things Right (20)

Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 

Último (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Designing Structured Streaming Pipelines—How to Architect Things Right

  • 1. Tathagata “TD” Das Designing Structured Streaming Pipelines How to Architect Things Right #UnifiedAnalytics #SparkAISummit @tathadas
  • 2. Structured Streaming Distributed stream processing built on SQL engine High throughput, second-scale latencies Fault-tolerant, exactly-once Great set of connectors Philosophy: Data streams are unbounded tables Users write batch-like code on a table Spark will automatically run code incrementally on streams 2
  • 3. Structured Streaming @ 1000s of customer streaming apps in production on Databricks 1000+ trillions of rows processed in production
  • 4. Structured Streaming 4 #UnifiedAnalytics #SparkAISummit Example Read JSON data from Kafka Parse nested JSON Store in structured Parquet table Get end-to-end failure guarantees ETL
  • 5. Anatomy of a Streaming Query 5 spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Source Specify where to read data from Built-in support for Files / Kafka / Kinesis* Can include multiple sources of different types using join() / union() *Available only on Databricks Runtime creates a DataFrame key value topic partition offset timestamp [binary] [binary] topicA 0 345 1486087873 [binary] [binary] topicB 3 2890 1486086721
  • 6. spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Anatomy of a Streaming Query 6 #UnifiedAnalytics #SparkAISummit Transformations Cast bytes from Kafka to a string, parse it as a json, and generate nested columns 100s of built-in, optimized SQL functions like from_json user-defined functions, lambdas, function literals with map, flatMap…
  • 7. spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") Anatomy of a Streaming Query 7 Sink Write transformed output to external storage systems Built-in support for Files / Kafka Use foreach to execute arbitrary code with the output data Some sinks are transactional and exactly once (e.g. files)
  • 8. spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start() Anatomy of a Streaming Query 8 #UnifiedAnalytics #SparkAISummit Processing Details Trigger: when to process data - Fixed interval micro-batches - As fast as possible micro-batches - Continuously (new in Spark 2.3) Checkpoint location: for tracking the progress of the query
  • 9. DataFrames, Datasets, SQL Logical Plan Read from Kafka Project device, signal Filter signal > 15 Write to Parquet Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Plan Series of Incremental Execution Plans process newdata t = 1 t = 2 t = 3 process newdata process newdata spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start()
  • 10. Anatomy of a Streaming Query Raw data from Kafka available as structured data in seconds, ready for querying spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start() STRUCTURED STREAMING Complex ETL
  • 11. My past talks: Deep dives 11 #UnifiedAnalytics #SparkAISummit A Deep Dive into Stateful Stream Processing in Structured Streaming Spark + AI Summit 2018 A Deep Dive Into Structured Streaming Spark Summit 2016 STRUCTURED STREAMING Complex ETL
  • 12. This talk: Streaming Design Patterns Zoomed out view of your data pipeline 12 STRUCTURED STREAMING Complex ETL DATA INSIGHTS
  • 13. Another streaming design pattern talk? Most talks Focus on a pure streaming engine Explain one way of achieving the end goal 13 This talk Spark is more than a streaming engine Spark has multiple ways of achieving the end goal with tunable perf, cost and quality
  • 14. This talk How to think about design Common design patterns How we are making this easier 14
  • 15. Streaming Pipeline Design 15 Data streams Insights ????
  • 17. What? 17 What is your input? What is your data? What format and system is your data in? What is your output? What results do you need? What throughput and latency do you need?
  • 18. Why? 18 Why do you want this output in this way? Who is going to take actions based on it? When and how are they going to consume it? humans? computers?
  • 19. Why? Common mistakes! #1 "I want my dashboard with counts to be updated every second" #2 "I want to generate automatic alerts with up-to-the-last second count" (but my input data is often delayed) 19 No point of updating every second if humans are going to take actions in minutes or hours No point taking fast actions on low quality data and results
  • 20. Why? Common mistakes! #3 "I want to train machine learning models on the results" (but my results are in a key-value store) 20 Key-value stores are not great for large, repeated data scans which machine learning workloads perform
  • 21. How? 21 ???? How to process the data? ???? How to store the results?
  • 23. Pattern 1: ETL 23 Input: unstructured input stream from files, Kafka, etc. What? Why? Query latest structured data interactively or with periodic jobs 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … Output: structured tabular data
  • 24. P1: ETL 24 Convert unstructured input to structured tabular data Latency: few minutes What? How? Why? Query latest structured data interactively or with periodic jobs Process: Use Structured Streaming query to transform unstructured, dirty data Run 24/7 on a cluster with default trigger Store: Save to structured scalable storage that supports data skipping, etc. E.g.: Parquet, ORC, or even better, Delta Lake STRUCTURED STREAMING ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  • 25. P1: ETL 25 How? Store: Save to STRUCTURED STREAMING Read with snapshot guarantees while writes are in progress Concurrently reprocess data with full ACID guarantees Coalesce small files into larger files Fix mistakes in existing data REPROCESS ETL QUERY 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  • 26. P1: Cheaper ETL 26 Convert unstructured input to structured tabular data Latency: few minutes hours Not have clusters up 24/7 What? How? Why? Query latest data interactively or with periodic jobs Cheaper solution Process: Still use Structured Streaming query! Run streaming query with "trigger.once" for processing all available data since last batch Set up external schedule (every few hours?) to periodically start a cluster and run one batch STRUCTURED STREAMING RESTART ON SCHEDULE 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo …
  • 27. P1: Query faster than ETL! 27 Latency: hours seconds What? How? Why? Query latest up-to-the last second data interactively Query data in Kafka directly using Spark SQL Can process up to the last records received by Kafka when the query was started SQL
  • 28. Pattern 2: Key-value output 28 Input: new data for each key What? Why? Lookup latest value for key (dashboards, websites, etc.) OR Summary tables for querying interactively or with periodic jobs KEY LATEST VALUE key1 value2 key2 value3 { "key1": "value1" } { "key1": "value2" } { "key2": "value3" } Output: updated values for each key Aggregations (sum, count, …) Sessionizations
  • 29. P2.1: Key-value output for lookup 29 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key STRUCTURED STREAMING Process: Use Structured Streaming with stateful operations for aggregation Store: Save in key-values stores optimized for single key lookups STATEFUL AGGREGATION LOOKUP { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 30. P2.2: Key-value output for analytics 30 Generate updated values for keys Latency: seconds/minutes What? How? Why? Lookup latest value for key Summary tables for analytics Process: Use Structured Streaming with Store: Save in Delta Lake! STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY stateful operations for aggregation Managed Delta Lake supports upserts using SQL MERGE { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 31. P2.2: Key-value output for analytics 31 How? STRUCTURED STREAMING STATEFUL AGGREGATION streamingDataFrame.foreachBatch { batchOutput => spark.sql(""" MERGE INTO deltaTable USING batchOutput WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ... """) }.start() SUMMARY stateful operations for aggregation Managed Delta Lake supports upserts using SQL MERGE Coming to OSS Delta Lake soon! { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 32. P2.2: Key-value output for analytics 32 How? Stateful aggregation requires setting watermark to drop very late data Dropping some data leads some inaccuracies Stateful operations for aggregation Managed Delta Lake supports upserts using SQL MERGE STRUCTURED STREAMING STATEFUL AGGREGATION SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 33. P2.3: Key-value aggregations for analytics 33 Generate aggregated values for keys Latency: hours/days Do not drop any late data What? How? Why? Summary tables for analytics Correct Process: ETL to structured table (no stateful aggregation) Store: Save in Delta Lake Post-process: Aggregate after all delayed data received STRUCTURED STREAMING STATEFUL AGGREGATION AGGREGATIONETL SUMMARY { "key1": "value1" } { "key1": "value2" } { "key2": "value3" }
  • 34. Pattern 3: Joining multiple inputs 34 #UnifiedAnalytics #SparkAISummit { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value Input: Multiple data streams based on common key Output: Combined information What?
  • 35. P3.1: Joining fast and slow data 35 Input: One fast stream of facts and one slow stream of dimension changes Output: Fast stream enriched by data from slow stream Example: product sales info enriched by more production info What + Why? { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value How? ETL slow stream to a dimension table Join fast stream with snapshots of the dimension table FAST ETL DIMENSION TABLE COMBINED TABLE JOIN SLOW ETL
  • 36. P3.1: Joining fast and slow data 36 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … id update value SLOW ETL How? - Caveats FAST ETL JOIN COMBINED TABLE DIMENSION TABLE Store dimension table in Delta Lake Delta Lake's versioning allows changes to be detected and the snapshot automatically reloaded without restart** Better Solution ** available only in Managed Delta Lake in Databricks Runtime Structured Streaming by default does reload dimension table snapshot Changes by slow ETL wont be seen until restart
  • 37. P3.1: Joining fast and slow data 37 How? - Caveats Treat it as a "joining fast and fast data" Better Solution Delays in updates to dimension table can cause joining with stale dimension data E.g. sale of a product received even before product table has any info on the product
  • 38. P3.2: Joining fast and fast data 38 Input: Two fast streams where either stream maybe delayed Output: Combined info even if one is delayed over another Example: ad impressions and ad clicks Use stream-stream joins in Structured Streaming Data will be buffered as state Watermarks define how long to buffer before giving up on matching What + Why? i d upd ate val ue STREAM STREAM JOIN { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … 01:06:45 WARN id = 1 , update failed 01:06:45 INFO id=23, update success 01:06:57 INFO id=87: update postpo … See my previous deep dive talk for more info
  • 39. Pattern 4: Change data capture 39 #UnifiedAnalytics #SparkAISummit INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 key value a 3 b 4 Input: Change data based on a primary key Output: Final table after changes End-to-end replication of transactional tables into analytical tables What? Why?
  • 40. P4: Change data capture 40 How? Use Structured Streaming and Delta Lake! In each batch, apply changes to the Delta table using MERGE MERGE in Managed Data Lake supports UPDATE, INSERT and DELETE Coming soon to OSS Delta Lake! INSERT a, 1 INSERT b, 2 UPDATE a, 3 DELETE b INSERT b, 4 STRUCTURED STREAMING streamingDataFrame.foreachBatch { batchOutput => spark.sql(""" MERGE INTO deltaTable USING batchOutput WHEN MATCHED ... THEN UPDATE ... WHEN MATCHED ... THEN INSERT ... WHEN WHEN NOT MATCHED THEN INSERT ... """) }.start() See Databricks docs for more info
  • 41. Pattern 5: Writing to multiple outputs 41 id val1 val2 id sum count RAW LOGS + SUMMARIES TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … SUMMARY FOR ANALYTICS + SUMMARY FOR LOOKUP TABLE CHANGE LOG + UPDATED TABLE What?
  • 42. P5: Serial or Parallel? 42 id val1 val2 id sum count TABLE 2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … Parallel How? id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Writes table 1 and reads it again Cheap or expensive depends on the size and format of table 1 Higher latency Reads input twice, may have to parse the data twice Cheap or expensive depends on size of raw input + parsing cost Serial
  • 43. P5: Combination! 43 Combo 1: Multiple streaming queries id val1 val2 TABLE 1 { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 TABLE 2 Do expensive parsing once, write to table 1 Do cheaper follow up processing from table 1 Good for ETL + multiple levels of summaries Change log + updated table Still writing + reading table1 Compute once, write multiple times Cheaper, but looses exactly-once guarantee id val1 val2 TABLE 3 Combo 2: Single query + foreachBatch EXPENSIVE CHEAP { "id": 14, "name": "td", "v": 100.. { "id": 23, "name": "by", "v": -10.. { "id": 57, "name": "sz", "v": 34.. … id val1 val2 LOCATION 1 id val1 val2 LOCATION 2 streamingDataFrame.foreachBatch { batchSummaryData => // write summary to Delta Lake // write summary to key value store }.start() How?
  • 44. Production Pipelines @ REPORTING STREAMING ANALYTICS Bronze Tables Silver Tables Gold Tables STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING
  • 45. Production Pipelines @ REPORTING STREAMING ANALYTICS Bronze Tables Silver Tables Gold Tables STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING Productionizing a single pipeline requires the following Unit testing Integration testing Deploying Monitoring Upgrading See Burak's talk Productizing Structured Streaming Jobs
  • 46. Production Pipelines @ Bronze Tables Silver Tables Gold TablesBut we really need to productionize the entire DAG of pipelines! REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING
  • 47. Future direction Declarative APIs to specify and manage entire DAGs of pipelines 47
  • 48. Delta Pipelines REPORTING STREAMING ANALYTICS STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING 48 source( name = "raw", df = spark.readStream…) sink( name = "warehouse", (df) => df.writeStream… flow( name = "ingest", df = input("raw") .withColumn(…), dest = "warehouse", latency = "5 minutes") A small library on top of Apache Spark that lets users express their whole pipeline as a dataflow graph
  • 49. Delta Pipelines 49 STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING STRUCTURED STREAMING STRUCTUREDSTREAMING STRUCTURED STREAMING REPORTING STREAMING ANALYTICS A small library on top of Apache Spark that lets users express their whole pipeline as a dataflow graph Specify expectations with Delta tables Unit test DAG with sample input data Integration test DAG with production data Deploy/upgrade new code of the entire DAG Rollback code and/or data when needed Monitor entire DAG in one place
  • 51. Thanks for listening! AMA at 4:00 PM today #UnifiedAnalytics #SparkAISummit