SlideShare a Scribd company logo
1 of 38
Download to read offline
Data Pipelines with Azure Synapse:
Real-life scenarios and solutions
Dustin Vannoy
dustin@dustinvannoy.com
Dustin
Consultant – Data Engineer
/in/dustinvannoy
youtube.com/DustinVannoy
dustinvannoy.com
Data Engineering SD meetup
Technologies
➢ Azure & AWS
➢ Apache Spark
➢ Apache Kafka
➢ Azure Synapse Analytics
➢ Python & Scala
Vannoy
@dustinvannoy
Agenda
What is a Data Pipeline?
Technology Overview
Scenario 1: Ingest from Azure Storage
Scenario 2: Ingest from SQL Server
Scenario 3: Ingest streaming data
What is a Data
Pipeline?
Defining Data Pipeline (General)
A set of jobs that process data
from one place to another.
Defining Data Pipeline (Typical Use)
The process of bringing data into a
data lake or data warehouse,
including cleaning, enriching, and
transforming data.
Data Lake Defined
Big Data Capable
Store first, evaluate
and model later
Data Zones Ready for Analysts
Query layer, other
analytic tools access
Raw
Enriched
Curated / Certified
Data Warehouse Defined
Structured Data
Processed and
modeled for
analytics use
Interactive query
Analysts can get
answers to questions
quickly
BI tool support
Reporting tools can
query efficiently
Curate
Enrich
Clean
Make Available
Collect
Data Ingestion Decisions
Do we use Azure Data Factory or Synapse Pipelines?
How do we schedule and orchestrate job steps?
How do we monitor job success?
Do we attempt to validate data quality?
Any field level encryption required?
Technology Overview
Data Lake Storage, Gen 2
• Built on Azure Blob Storage
• Hadoop compatible access
• Optimized for cloud analytics
• Low cost: $
Managed Apache Spark
Synapse Pipelines
Serverless & Dedicated SQL
Data Explorer
AZURE SYNAPSE ANALYTICS
Serverless Apache Spark for data
processing and exploration
Synapse Pipelines for no-code or
low-code data ingestion
Serverless SQL for easy querying
Dedicated SQL for high
performance analytic queries
using MPP database
Synapse Capabilities
Ingest from Azure
Storage
Synapse Data Lake Ingest
Sources
Azure Data Lake
Storage
Synapse Spark
Why Spark?
Big data and the
cloud changed our
mindset. We want
tools that scale
easily as data size
grows.
⮚ Fast, general purpose data
processing
⮚ Simple code for
distributed processing
⮚ Many options to develop
and run
Simple code, parallel compute
Worker
Controller
Worker
Worker
Worker
Demo
Azure
Storage
Ingest
Ingest from SQL
Server
Ingest from SQL Server
How can I keep the table schema?
How will I maintain this as new tables get added?
How will I deal with new or removed columns?
Can I do a full reload of every table for every run?
Is it outside of our Azure virtual network?
Can private endpoint be easily configured?
Do I need to add specific IPs to an allow list?
Demo
SQL Server
Ingest
Ingest from Event
Stream
Synapse Spark Streaming
Apache Kafka
Synapse Spark
Sources Data Lake Storage
Why Kafka? Apache Kafka is a
scalable message broker
/ distributed log.
Producers can quickly
publish and move on
while data is persisted
for all consumers.
Reliable place to
stream events;
decoupled from
destination
Distributed Log (message broker)
Decouple producer and consumer
Durable storage
Low-latency
High scalability
Apache Kafka
Hub for streaming data
Data Lake
Post data
User Dashboard
Real-time report
User data
Apache Kafka / Event Hubs
What is Spark Structured Streaming?
"The simplest way to perform streaming analytics is not having to
reason about streaming at all"
A table that is constantly appended with each micro-batch
- Tathagata Das “TD”
Reference: https://youtu.be/rl8dIzTpxrI
Structured Streaming - Read
df = spark.readStream
.format("kafka")
.options(**consumer_config)
.load()
Structured Streaming - Write
df.writeStream
.format("kafka")
.options(**producer_config)
.option("checkpointLocation","/tmp/cp001")
.start()
Structured Streaming –Checkpoint
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/chkpnt/dq1")
.start("/tmp/demo_out"))
Structured Streaming – Output Mode
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/chkpnt/dq1")
.start("/tmp/demo_out"))
Spark Streaming Benefits
● Re-use Spark batch code
● Stateful streaming and joins
● Mature with many integrations
● Kafka or Event Hubs not required
Demo
Ingest
Event
Stream
Final Thoughts
Session Feedback Surveys
In the pursuit of making our conferences even better, we need to hear your
feedback about this session.
Here’s How -
▪ Simply go to the Whova App on your smartphone
▪ Go to the conference homepage
▪ Scroll down to ‘Additional Resources’ and click ‘Surveys’.
▪ Click ‘Session Feedback’.
▪ Scroll down to click on this session title.
▪ Complete the session feedback survey.
▪ Finally, click ‘Submit’
DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

More Related Content

Similar to DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaAttunity
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 

Similar to DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf (20)

Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline Accelerator
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 

DustinVannoy_DataPipelines_AzureDataConf_Dec22.pdf

  • 1. Data Pipelines with Azure Synapse: Real-life scenarios and solutions Dustin Vannoy dustin@dustinvannoy.com
  • 2. Dustin Consultant – Data Engineer /in/dustinvannoy youtube.com/DustinVannoy dustinvannoy.com Data Engineering SD meetup Technologies ➢ Azure & AWS ➢ Apache Spark ➢ Apache Kafka ➢ Azure Synapse Analytics ➢ Python & Scala Vannoy @dustinvannoy
  • 3. Agenda What is a Data Pipeline? Technology Overview Scenario 1: Ingest from Azure Storage Scenario 2: Ingest from SQL Server Scenario 3: Ingest streaming data
  • 4. What is a Data Pipeline?
  • 5. Defining Data Pipeline (General) A set of jobs that process data from one place to another.
  • 6. Defining Data Pipeline (Typical Use) The process of bringing data into a data lake or data warehouse, including cleaning, enriching, and transforming data.
  • 7. Data Lake Defined Big Data Capable Store first, evaluate and model later Data Zones Ready for Analysts Query layer, other analytic tools access Raw Enriched Curated / Certified
  • 8. Data Warehouse Defined Structured Data Processed and modeled for analytics use Interactive query Analysts can get answers to questions quickly BI tool support Reporting tools can query efficiently
  • 10. Data Ingestion Decisions Do we use Azure Data Factory or Synapse Pipelines? How do we schedule and orchestrate job steps? How do we monitor job success? Do we attempt to validate data quality? Any field level encryption required?
  • 12. Data Lake Storage, Gen 2 • Built on Azure Blob Storage • Hadoop compatible access • Optimized for cloud analytics • Low cost: $
  • 13. Managed Apache Spark Synapse Pipelines Serverless & Dedicated SQL Data Explorer AZURE SYNAPSE ANALYTICS
  • 14. Serverless Apache Spark for data processing and exploration Synapse Pipelines for no-code or low-code data ingestion Serverless SQL for easy querying Dedicated SQL for high performance analytic queries using MPP database Synapse Capabilities
  • 16. Synapse Data Lake Ingest Sources Azure Data Lake Storage Synapse Spark
  • 17. Why Spark? Big data and the cloud changed our mindset. We want tools that scale easily as data size grows. ⮚ Fast, general purpose data processing ⮚ Simple code for distributed processing ⮚ Many options to develop and run
  • 18. Simple code, parallel compute Worker Controller Worker Worker Worker
  • 21. Ingest from SQL Server How can I keep the table schema? How will I maintain this as new tables get added? How will I deal with new or removed columns? Can I do a full reload of every table for every run? Is it outside of our Azure virtual network? Can private endpoint be easily configured? Do I need to add specific IPs to an allow list?
  • 24. Synapse Spark Streaming Apache Kafka Synapse Spark Sources Data Lake Storage
  • 25. Why Kafka? Apache Kafka is a scalable message broker / distributed log. Producers can quickly publish and move on while data is persisted for all consumers. Reliable place to stream events; decoupled from destination
  • 26. Distributed Log (message broker) Decouple producer and consumer Durable storage Low-latency High scalability Apache Kafka
  • 27. Hub for streaming data Data Lake Post data User Dashboard Real-time report User data Apache Kafka / Event Hubs
  • 28. What is Spark Structured Streaming? "The simplest way to perform streaming analytics is not having to reason about streaming at all" A table that is constantly appended with each micro-batch - Tathagata Das “TD” Reference: https://youtu.be/rl8dIzTpxrI
  • 29. Structured Streaming - Read df = spark.readStream .format("kafka") .options(**consumer_config) .load()
  • 30. Structured Streaming - Write df.writeStream .format("kafka") .options(**producer_config) .option("checkpointLocation","/tmp/cp001") .start()
  • 32. Structured Streaming – Output Mode df.writeStream .format("delta") .outputMode("append") .option("checkpointLocation","/chkpnt/dq1") .start("/tmp/demo_out"))
  • 33. Spark Streaming Benefits ● Re-use Spark batch code ● Stateful streaming and joins ● Mature with many integrations ● Kafka or Event Hubs not required
  • 36.
  • 37. Session Feedback Surveys In the pursuit of making our conferences even better, we need to hear your feedback about this session. Here’s How - ▪ Simply go to the Whova App on your smartphone ▪ Go to the conference homepage ▪ Scroll down to ‘Additional Resources’ and click ‘Surveys’. ▪ Click ‘Session Feedback’. ▪ Scroll down to click on this session title. ▪ Complete the session feedback survey. ▪ Finally, click ‘Submit’