3. Agenda
What is a Data Pipeline?
Technology Overview
Scenario 1: Ingest from Azure Storage
Scenario 2: Ingest from SQL Server
Scenario 3: Ingest streaming data
5. Defining Data Pipeline (General)
A set of jobs that process data
from one place to another.
6. Defining Data Pipeline (Typical Use)
The process of bringing data into a
data lake or data warehouse,
including cleaning, enriching, and
transforming data.
7. Data Lake Defined
Big Data Capable
Store first, evaluate
and model later
Data Zones Ready for Analysts
Query layer, other
analytic tools access
Raw
Enriched
Curated / Certified
8. Data Warehouse Defined
Structured Data
Processed and
modeled for
analytics use
Interactive query
Analysts can get
answers to questions
quickly
BI tool support
Reporting tools can
query efficiently
10. Data Ingestion Decisions
Do we use Azure Data Factory or Synapse Pipelines?
How do we schedule and orchestrate job steps?
How do we monitor job success?
Do we attempt to validate data quality?
Any field level encryption required?
14. Serverless Apache Spark for data
processing and exploration
Synapse Pipelines for no-code or
low-code data ingestion
Serverless SQL for easy querying
Dedicated SQL for high
performance analytic queries
using MPP database
Synapse Capabilities
16. Synapse Data Lake Ingest
Sources
Azure Data Lake
Storage
Synapse Spark
17. Why Spark?
Big data and the
cloud changed our
mindset. We want
tools that scale
easily as data size
grows.
⮚ Fast, general purpose data
processing
⮚ Simple code for
distributed processing
⮚ Many options to develop
and run
21. Ingest from SQL Server
How can I keep the table schema?
How will I maintain this as new tables get added?
How will I deal with new or removed columns?
Can I do a full reload of every table for every run?
Is it outside of our Azure virtual network?
Can private endpoint be easily configured?
Do I need to add specific IPs to an allow list?
25. Why Kafka? Apache Kafka is a
scalable message broker
/ distributed log.
Producers can quickly
publish and move on
while data is persisted
for all consumers.
Reliable place to
stream events;
decoupled from
destination
26. Distributed Log (message broker)
Decouple producer and consumer
Durable storage
Low-latency
High scalability
Apache Kafka
27. Hub for streaming data
Data Lake
Post data
User Dashboard
Real-time report
User data
Apache Kafka / Event Hubs
28. What is Spark Structured Streaming?
"The simplest way to perform streaming analytics is not having to
reason about streaming at all"
A table that is constantly appended with each micro-batch
- Tathagata Das “TD”
Reference: https://youtu.be/rl8dIzTpxrI
37. Session Feedback Surveys
In the pursuit of making our conferences even better, we need to hear your
feedback about this session.
Here’s How -
▪ Simply go to the Whova App on your smartphone
▪ Go to the conference homepage
▪ Scroll down to ‘Additional Resources’ and click ‘Surveys’.
▪ Click ‘Session Feedback’.
▪ Scroll down to click on this session title.
▪ Complete the session feedback survey.
▪ Finally, click ‘Submit’