[Webinar] SpiraTest - Setting New Standards in Quality Assurance
Cloud ETL w/Azure Data Factory & Modern Data Warehouse
1. Cloud-First ETL in the Modern Data
Warehouse w/Azure Data Factory
Sr. Program Manager
Azure Data Management
https://github.com/kromerm/Azure-Data-Week-ADF
2. A fully-managed data integration service in the cloud
A Z U R E D ATA F A C T O R Y
H Y B R I D S C A L A B L EP R O D U C T I V E T R U S T E D
Serverless scalability
with no infrastructure
to manage
Drag & Drop UI
Codeless Data
Movement
Orchestrate where
your data lives
Lift SSIS packages
to Azure
Certified compliant
Data Movement
3. MODEL & SERVE
Azure Analysis ServicesAzure SQL Data
Warehouse
Power BI
Modernize your enterprise data warehouse at scale
A Z U R E D A T A F A C T O R Y
On-premises data
Oracle, SQL, Teradata,
fileshares, SAP
Cloud data
Azure, AWS, GCP
SaaS data
Salesforce, Workday,
Dynamics
INGEST STORE PREP & TRAIN
Azure Data Factory Azure Blob Storage
Azure Databricks
Polybase
Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure SQL Database and Azure Data Lake to allow customers to tailor the above architecture to meet their unique needs.
Orchestrate with Azure Data Factory
4. Lift your SQL Server Integration Services (SSIS) packages to Azure
On-Premise data sources
SQL DB Managed Instance
SQL Server
VNET
Azure Data Factory
SSIS Cloud ETL
SSIS Integration Runtime
Cloud data sources
Cloud
On-premises
Microsoft
SQL Server
Integration Services
5. Control Flow in ADF Pipeline Builder
Coordinate pipeline activities into finite execution steps to enable looping,
conditionals and chaining while separating data transformations into
individual data flows
Activity 1 Activity 2
Activity 3
“On
Error”
Activity 1
Success,
params
Error,
param
s
My Pipeline 1
…
My Pipeline 2For Each…
Activity 4
Success,
params
Trigger
Event
Wall Clock
On Demand
Activity 1
Activity 2
…
11. ADF Integration Runtime (IR)
ADF compute environment with multiple capabilities:
- Activity dispatch & monitoring
- Data movement
- SSIS package execution
To integrate data flow and control flow across the
enterprises’ hybrid cloud, customer can instantiate
multiple IR instances for different network environments:
- On premises (similar to DMG in ADF V1)
- In public cloud
- Inside VNet
Bring a consistent provision and monitoring experience
across the network environments
Portal Application & SDK
Azure Data Factory Service
Data Movement & Activity
Dispatch on-prem, Cloud,
VNET
Data Movement & Activity
Dispatch In Azure Public
Network, SSIS
VNET coming soon
Self-Hosted IR Azure IR
12. Azure Data Factory Patterns
Azure SQL Database Change Tracking
Data Flow: Star Schema Fact &
Dimension Table Loading
Pipeline Master / Child Controller Pattern
13. Microsoft ETL/ELT Services in Azure
• Azure-SSIS IR: Managed cluster of Azure VMs
(nodes) dedicated to run your SSIS packages
and no other activities
• You can scale it up/out by specifying the node
size /number of nodes in the cluster
• You can bring your own Azure SQL Database
(DB)/Managed Instance (MI) server to host the
catalog of SSIS projects/packages (SSISDB) that
will be attached to it
• You can join it to a Virtual Network (VNet) that is
connected to your on-prem network to enable on-
prem data access
• Once provisioned, you can enter your Azure SQL
DB/MI server endpoint on SSDT/SSMS to deploy
SSIS projects/packages and configure/execute
them just like using SSIS on premises
Execute/Manage
Provision
SSDT SSMS
Cloud
On-
Premises Design/Deploy
SSIS Server
Design/Deploy Execute/Manage
ISVs
SSIS PaaS w/
ADF compute
resource
ADF
(PROVISIONING)
HDI
ML
17. Visual Data Flow Authoring
• Transform Data, At Scale, in the Cloud, Zero-Code
• Cloud-first, scale-out ELT
• Code-free dataflow pipelines
• Serverless scale-out transformation execution engine
• Maximum Productivity for Data Engineers
• Does NOT require understanding of Spark / Scala / Python / Java
• Resilient Data Transformation Flows
• Built for big data scenarios with unstructured data requirements
• Operationalize with Data Factory scheduling, control flow and monitoring
18. Code-free Data Transformation At Scale
• Does not require understanding of Spark, Big Data Execution
Engines, Clusters, Scala, Python …
• Focus on building business logic and data transformation
• Data cleansing
• Aggregation
• Data conversions
• Data prep
• Data exploration
19. ADF Data Flow Workstream
Data Sources Staging Transformation
s
Destination
Sort, Merge, Join,
Lookup …
• Explicit user action
• User places data
source(s) on design
surface, from toolbox
• Select explicit sources
• Implicit/Explicit
• Data Lake staging area
as default
• User does not need to
configure this manually
• Advanced feature to set
staging area options
• File Formats / Types
(Parquet, JSON, txt,
CSV …)
• Explicit user action
• User places
transformations on
design surface, from
toolbox
• User must set
properties for
transformation steps
and step connectors
• Explicit user action
• User chooses
destination
connector(s)
• User sets connector
property options
27. Data Engineer derives columns using template expression
based on name and type matching. No need to define static
field names.
Notas do Editor
5
SSIS PaaS requires VNet to enable on-prem data access for now, but we may also reuse ADF Data Management Gateway (DMG) to do the same in the near future.