DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

Data Pipelines with Azure Synapse:
Real-life scenarios and solutions
Dustin Vannoy
dustin@dustinvannoy.com

Dustin
Consultant – Data Engineer
/in/dustinvannoy
youtube.com/DustinVannoy
dustinvannoy.com
Data Engineering SD meetup
Technologies
➢ Azure & AWS
➢ Apache Spark
➢ Apache Kafka
➢ Azure Synapse Analytics
➢ Python & Scala
Vannoy
@dustinvannoy

Agenda
What is a Data Pipeline?
Technology Overview
Scenario 1: Ingest from Azure Storage
Scenario 2: Ingest from SQL Server
Scenario 3: Ingest streaming data

Defining Data Pipeline (General)
A set of jobs that process data
from one place to another.

Defining Data Pipeline (Typical Use)
The process of bringing data into a
data lake or data warehouse,
including cleaning, enriching, and
transforming data.

Data Lake Defined
Big Data Capable
Store first, evaluate
and model later
Data Zones Ready for Analysts
Query layer, other
analytic tools access
Raw
Enriched
Curated / Certified

Data Warehouse Defined
Structured Data
Processed and
modeled for
analytics use
Interactive query
Analysts can get
answers to questions
quickly
BI tool support
Reporting tools can
query efficiently

Curate
Enrich
Clean
Make Available
Collect

Data Ingestion Decisions
Do we use Azure Data Factory or Synapse Pipelines?
How do we schedule and orchestrate job steps?
How do we monitor job success?
Do we attempt to validate data quality?
Any field level encryption required?

Data Lake Storage, Gen 2
• Built on Azure Blob Storage
• Hadoop compatible access
• Optimized for cloud analytics
• Low cost: $

Managed Apache Spark
Synapse Pipelines
Serverless & Dedicated SQL
Data Explorer
AZURE SYNAPSE ANALYTICS

Serverless Apache Spark for data
processing and exploration
Synapse Pipelines for no-code or
low-code data ingestion
Serverless SQL for easy querying
Dedicated SQL for high
performance analytic queries
using MPP database
Synapse Capabilities

Synapse Data Lake Ingest
Sources
Azure Data Lake
Storage
Synapse Spark

Why Spark?
Big data and the
cloud changed our
mindset. We want
tools that scale
easily as data size
grows.
⮚ Fast, general purpose data
processing
⮚ Simple code for
distributed processing
⮚ Many options to develop
and run

Simple code, parallel compute
Worker
Controller
Worker
Worker
Worker

Ingest from SQL Server
How can I keep the table schema?
How will I maintain this as new tables get added?
How will I deal with new or removed columns?
Can I do a full reload of every table for every run?
Is it outside of our Azure virtual network?
Can private endpoint be easily configured?
Do I need to add specific IPs to an allow list?

Synapse Spark Streaming
Apache Kafka
Synapse Spark
Sources Data Lake Storage

Why Kafka? Apache Kafka is a
scalable message broker
/ distributed log.
Producers can quickly
publish and move on
while data is persisted
for all consumers.
Reliable place to
stream events;
decoupled from
destination

Distributed Log (message broker)
Decouple producer and consumer
Durable storage
Low-latency
High scalability
Apache Kafka

Hub for streaming data
Data Lake
Post data
User Dashboard
Real-time report
User data
Apache Kafka / Event Hubs

What is Spark Structured Streaming?
"The simplest way to perform streaming analytics is not having to
reason about streaming at all"
A table that is constantly appended with each micro-batch
- Tathagata Das “TD”
Reference: https://youtu.be/rl8dIzTpxrI

Structured Streaming - Read
df = spark.readStream
.format("kafka")
.options(**consumer_config)
.load()

Structured Streaming - Write
df.writeStream
.format("kafka")
.options(**producer_config)
.option("checkpointLocation","/tmp/cp001")
.start()

Structured Streaming –Checkpoint
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/chkpnt/dq1")
.start("/tmp/demo_out"))

Structured Streaming – Output Mode
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/chkpnt/dq1")
.start("/tmp/demo_out"))

Spark Streaming Benefits
● Re-use Spark batch code
● Stateful streaming and joins
● Mature with many integrations
● Kafka or Event Hubs not required

Session Feedback Surveys
In the pursuit of making our conferences even better, we need to hear your
feedback about this session.
Here’s How -
▪ Simply go to the Whova App on your smartphone
▪ Go to the conference homepage
▪ Scroll down to ‘Additional Resources’ and click ‘Surveys’.
▪ Click ‘Session Feedback’.
▪ Scroll down to click on this session title.
▪ Complete the session feedback survey.
▪ Finally, click ‘Submit’

DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

Recommended

Recommended

More Related Content

Similar to DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

Similar to DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf (20)

Recently uploaded

Recently uploaded (20)

DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf