This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing, modeling and serving data on Azure. Finally, it discusses architectures like the lambda architecture and common data models.
3. Antonios
Chatzipavlis
Data Solutions
Consultant & Trainer
1988 Beginning of my professional career
1996 I started working with SQL Server 6.0
1998 Certified as MCSD (3rd in Greece)
1999 Became an MCT
2010 Microsoft MVP on Data Platform
Created www.sqlschool.gr
2012 Became MCT Regional Lead by Microsoft Learning
2013 Certified as MCSE : Data Platform and
MCSE : Business Intelligence
2016 Certified as MCSE: Data Management & Analytics
2018 Certified as MCSA : Machine Learning
Recertified as MCSE: Data Management & Analytics
4. ⢠Articles
⢠SQL Server in Greek
⢠SQL Nights
⢠Webcasts
⢠SQL Server News
⢠Downloads
⢠Resources
What we are doing Follow us
fb/sqlschoolgr
fb/groups/sqlschool
@antoniosch
@sqlschool
yt/c/SqlschoolGr
SQLschool.gr Group
A community for
Greek professionals
who use the
Microsoft
Data Platform
Ask your question at help@sqlschool.gr
5. Explore
everything
PASS has
to offer
Free Online Resources
Newsletters
PASS.org
Get involved
Free online
webinar
events
Local user groups
around the world
Free 1-day local
training events
Online special
interest user
groups
Business analytics
training
7. A data warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection
of data in support of managementâs
decision making process.
WHAT IS A DATA WAREHOUSE?
17. ON-PREMISES
VS.
CLOUD DW
⢠Evaluating Time to Value
⢠Accounting for Storage and Computing Costs
⢠Sizing, Balancing and Tuning
⢠Considering Data Preparation and ETL Costs
⢠Cost of Specialized Business Analytic Tools
⢠Scaling and Elasticity
⢠Delays and Downtime
⢠Cost of Security Breaches
⢠Data Protection and Recovery
18. STEPS TO
GETTING STARTED
WITH CLOUD DW
⢠Evaluate your data warehousing needs.
⢠Migrate or start fresh.
⢠Establish success criteria.
⢠Evaluate cloud data warehouse solutions.
⢠Calculate your total cost of ownership.
⢠Set up a proof of concept (POC).
25. INGEST DATA
ADF
⢠PaaS
⢠Mapping Data Flow transform data (ETL)
⢠Copy Data tool easily copy from source
to destination
⢠Templates
⢠Any new project
⢠Converting SSIS packages
⢠Row by row ETL can be slower
⢠Data needs to be moved to Databricks â
limited by compute size
⢠Mapping Data flow takes time to startup
SSIS
⢠SSDT â Visual Studio
⢠Very popular product
⢠Used for on-prem ETL for may year
⢠Too big of an effort to migrate existing
packages
⢠Skillset staying on-prem
⢠Change to IR in ADF
⢠Row by row ETL can be slower
⢠Data need to moved to IR
⢠Limited by node size/number of SSIS IR
26. STORE DATA
ADLS Gen 2
⢠PaaS
⢠Best features of blob
storage
⢠Not all features are
available yet
⢠Some products not support
yet
⢠5TB file size limit
Blob Storage
⢠PaaS
⢠Original storage
⢠Most popular
⢠Donât use for new projects
⢠Account limit 2 PB for US
and Europe
⢠4,75TB file size limit
SQL Server 2019 Big Data
Cluster
⢠IaaS
⢠Combines SQL Server
database engine, Spark,
HDFS (ADLS Gen2) into a
unified data platform
⢠Deployed as containers on
Kubernetes
⢠Polybase
⢠Hybrid cloud
⢠Data virtualization
⢠AI Platform
27. PREP DATA
Azure Databricks
⢠PaaS
⢠Processing massive
amounts of data
⢠Training & deploy
models
⢠Manage workflows
⢠Spark & notebooks
⢠Integration with
ADLS, SQL DW, PBI
⢠Writing Code
⢠High learning curve
Azure HDInsight
⢠PaaS
⢠Deploys &
provisions Apache
Hadoop clusters
⢠No integration with
SQL DW
⢠Always running and
incurring cost
⢠Hortonworks
merged with
Cloudera
Polybase & Stored
Procedures in SQL
DW
⢠IaaS
⢠T-SQL queries via
external tables
⢠Tuning queries
⢠Increase storage
space
PowerBI Dataflow
⢠PowerBI service
⢠Power Query
⢠Self-service data
prep
⢠Individual solution
⢠Small workloads
⢠Donât use this to
replace a DW or
ADF
28. MODEL & SERVE DATA
Azure SQL DW
⢠PaaS
⢠Fully managed
petabyte scale
cloud DW
⢠Can scale compute
and storage
independently
⢠Can be paused
⢠MPP
Azure Analysis
Services
⢠PaaS
⢠Tabular model
⢠Fast queries
⢠High concurrency
⢠Semantic layer
⢠Vertical scale-out
⢠High availability
⢠Advanced time-
calculations
⢠Time to process
the cube
Azure SQL
Database
⢠PaaS
⢠Suitable for small
DW
⢠Size limits/tier
⢠Optimized for
OLTP
SQL Server in
VM
⢠IaaS
⢠MDX models
Cosmos DB
⢠PaaS
⢠Globally
distributed
⢠Multi-model
database service
⢠Spark to Cosmos
DB connector for
DW aggregations
29.
30. ETL vs ELT
ETL ELT
Time â Load Uses staging area and system, extra time to load data All in one system, load only once
Time â Transformation
Need to wait, especially for big data sizes - as data grows,
transformation time increases
All in one system, speed is not dependent on data size
Time â Maintenance
High maintenance - choice of data to load and transform and
must do it again if deleted or want to enhance the main data
repository
Low maintenance - all data is always available
Implementation complexity At early stage, requires less space and result is clean
Requires in-depth knowledge of tools and expert design of the
main large repository
Analysis & Processing style
Based on multiple scripts to create the views - deleting view
means deleting data
Creating adhoc views - low cost for building and maintaining
Data limitation or restriction By presuming and choosing data a priori By HW (none) and data retention policy
DW Support
Prevalent legacy model used for on-premises and relational,
structured data
Tailored to using in scalable cloud infrastructure to support
structured, unstructured such
big data sources
Data Lake Support Not part of approach Enables use of lake with unstructured data supported
Usability Fixed tables, Fixed timeline, Used mainly by IT
Ad Hoc, Agility, Flexibility, Usable by everyone from developer to
citizen integrator
Cost-effective Not cost-effective for small and medium businesses
Scalable and available to all business sizes using online SaaS
solutions
35. A community for Greek professionals who use the Microsoft Data Platform
Copyright Š 2018 SQLschool.gr. All right reserved. PRESENTER MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION