Data Warehouse Design and Best Practices

Ivo Andreev
Ivo AndreevSystem Architect
Data Warehouse Design 
Best Practices
About me 
 Project Manager @ 
 12 years professional experience 
 .NET Web Development MCPD 
 SQL Server 2012 (MCSA) 
 Business Interests 
 Web Development, SOA, Integration 
 Security  Performance Optimization 
 Horizon2020, Open BIM, GIS, Mapping 
 Contact me 
 ivelin.andreev@icb.bg 
 www.linkedin.com/in/ivelin 
 www.slideshare.net/ivoandreev 
2 |
About me 
 Senior Developer @ 
 .NET Web Development MCPD 
 Business Interests 
 Web Development, WCF, Integration 
 SQL Server – Query Optimization and Tuning 
 Data Warehousing 
 Contact me 
 georgi.mishev@icb.bg 
 www.linkedin.com/in/georgimishev
Sponsors
Agenda 
 Why Data Warehouse 
 Main DW Architectures 
 Dimensional Modeling 
 Patterns  Practices 
 DW Maintenance 
 ETL Process 
 SSIS Demo
Lots of Data Everywhere 
 Can’t find data? 
 Data scattered over the network 
 Can’t get data? 
 Need an expert to get the data 
 Can’t understand data? 
 Data poorly documented 
 Can’t use data found? 
 Data needs to be transformed
Data Warehouse? 
Def: Central repository where data are organized, cleansed 
and in standardized format. 
 Integrated 
 Heterogeneous sources 
 Data clean and conversion ($, €, 元) 
 Focus on subject 
 i.e. Customer, Sale, Product 
 Time variant 
 Timestamp every key 
 Historical data (10+ years)
Different Problems - Different Solutions 
OLTP Database Data Warehouse 
Users Customer Knowledge worker 
Design Normalized, Data Integrity Denormalized 
Function Daily operation Decision making 
Data Current, Detailed Historical, Aggregated 
Usage Real time Ad-hoc 
Access Short R/W transactions Complex R/O queries 
Data accessed Comparatively lower Large Amounts 
# Records x100 x1’000’000 
# Users x1’000 x10 
DB Size x10 GB x100GB-TB
Different DW Architectures
B.Inmon Model 
Top-Down Approach 
 Warehouse (3NF) 
 Data Mart  OLAP (MD) 
http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640h=368
R.Kimball Model 
Bottom-Up Approach 
 Data Marts (3NF or MD) 
 Warehouse  OLAP (MD) 
http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640h=369
Data Vault (by Dan Linstedt) 
 Hubs 
 List of unique business keys 
 Links 
 Unique relationships between keys 
 Satellites 
 Hub and Link details and history
It is irrelevant which camp you belong… 
as far as you understand why!
Making Your Choice 
• Kimball (MD) 
+ Start small, scale big 
+ Faster ROI 
+ Analytical tools 
- Low reusability 
• Data Vault 
• Inmon (3NF) 
+ Structured 
+ Easy to maintain 
+ Easier data mining 
- Timely to build 
Backend Data Warehouse 
+ Multiple sources; Full history; Incremental build 
- Up-front work; Long-term payoff; Many joins
Dimensional modeling as de-facto standard
Dimensions 
Def: The object of BI interest 
 Keys 
 Surrogate key 
 Business key 
 Hierarchical attributes 
 Analysis and Drill Down 
 Member properties 
 Presentation labels 
 Auditing information (not for end users)
Slowly Changing Dimensions 
Def: Scheme for recording changes over time 
 Type 1 - Overwrite 
 Type 2 – Multiple Records
Facts 
Def: Measurement of a business process 
 Keys 
 FK from all dimensional tables (in the star) 
 PK - Composite (usually) or Surrogate 
 Measures 
 Numeric columns, that are of interest to the business 
 Additive, Non-additive, Semi-additive 
 Factless facts 
 Auditing information (optional)
Practices and Design Patterns
Data Warehouse Pitfalls 
 Admit it is not as it seems to be 
 You need education 
 Find what is of business value 
 Rather than focus on performance 
 Spend a lot of time in Extract-Transform-Load 
 Homogenize data from different sources 
 Find (and resolve) problems in source systems
Prepare your Sources 
 Data integrity 
 Avoid redundancy 
 Data quality 
 Master data source 
 Data validation 
 Auditing 
 CreatedDate / CreatedBy 
 ChangedDate / ChangedBy 
 Nightly jobs
Dimension Design 
 Business key with non-clustered index 
 Include date (if dimension has history) 
 Surrogate key 
 The smallest possible integer 
 Clustered index 
 FK constraints 
 Do not enforce (WITH NOCHECK) 
 Document the relation 
 Faster load 
 Data validation 
 Task for the Source system
Conformed Dimensions 
Def. Having the same meaning and content 
when referred from multiple fact tables 
 Date Dimension 
 Partitioning best candidate 
 Granularity 
 Do not store every hour, when reporting daily 
 Avoid surrogate keys 
 Saves lookup and joins 
 Integer representing date (yyyyMMdd, days after 1/1/1900)
Pre-join Hierarchies 
 Recursive relationships 
 Fast drill and report 
 Pre-computed aggregations 
Hierarchy Bridge 
 For each dimension row 
 1 association with self 
 1 row for each subordinate
Determine the Facts 
The center of a Star schema 
 Identify subject areas 
 Identify key business events 
 Identify dimensions 
 Start from OLTP logical model 
 Identify historical requirements 
 Identify attributes
The Grain 
Def: The level of detail of a fact table 
 What is the business objective? 
 Fine grain - behaviour and frequency analysis 
 Coarse grain - overall and trend analysis 
 Aggregates 
 DO NOT summarize prematurely 
 DO NOT mix detail and summary 
 DO use “summary tables”
C3-PO is fluent in 6M forms of communication. 
What about your customers?
Multinational DW 
 What parts need translation? 
 Where to store various language versions? 
 How to support future languages? 
 Dimensions 
 Add language attribute 
 Include text data in the dimension 
 Problem 1: The dimension key? 
 Replicate PK for every language 
Fact.DimId = Dim.Id AND Dim.Lang=[Lang] 
 Problem 2: Storage = [Dim] x [Lang] 
 Sub-dimension with language attributes 
TxtId Attr1 Attr2 LangId 
1 large Yes En 
2 small No En 
1 stor Ja No 
2 liten Nei No 
3 … … …
Data warehouse maintenance
How Large is “Large” 
Is big really big?
Partitioning 
 Why 
 Faster index maintenance 
 Faster load 
 Faster queries 
 When 
 Tables 10GB+ 
 How 
 Do not partition dimension tables 
 Partition by date (most analysis are time-based) 
 Eliminate partitions (WHERE [PartitionKey]=…) 
 Avoid split and merge of existing partitions 
 Can cause inefficient log generation
Columnstore Index 
 Non-clustered in SQL 2012 
 Clustered in SQL 2014 
 Pros 
 Better data compression 
 High performance on table scan 
 Clustered CSI Limitations 
 No other indexes allowed 
 Little advantage on seek operations 
 No XML, computed column or replication
Extract-Transform-Load 
 Extract data from OLTP 
 Data transformations 
 Data loads 
 DW maintenance
Efficient Load Process 
 Use simple recovery model during data load 
 Staging 
 Avoid indexing 
 Populate in parallel 
 Maintain DW 
 Disable indexes on load 
 Rebuild manually after load 
 Automatic stats update slow down SQL Server
To SSIS, or not to SSIS ? 
Pros 
 Minimum coding to none 
 Extensive support of various data sources 
 Parallel execution of migration tasks 
 Better organization of the ETL process 
Cons 
 Another way of thinking 
 Hidden options 
 T-SQL developer would do much faster 
 Auto-generated flows need optimization 
 Sometimes simply does not work (i.e. Sort by GUID)
Data Warehouse Design and Best Practices
Takeaways 
 Books 
 The Data Warehouse Toolkit (3rd ed), Ralph Kimball 
 Implementing DW with Microsoft SQL Server 2012 
 Data Warehousing Fundamentals, Paulraj Ponniah 
 Articles 
 Best Practices in Data Warehouse (Hanover Research Council) 
 http://www.kimballgroup.com/category/design-tips/ 
 http://sqlmag.com/business-intelligence 
 Resources 
 http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/ 
dimensional-modeling-techniques/ 
 http://www.databaseanswers.org/data_models/index.htm
Data Warehouse Design and Best Practices
1 de 38

Recomendados

Data Lakehouse, Data Mesh, and Data Fabric (r1) por
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
5.5K visualizações27 slides
Introduction SQL Analytics on Lakehouse Architecture por
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
5.8K visualizações52 slides
Building a modern data warehouse por
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
15.6K visualizações57 slides
Data Warehouse or Data Lake, Which Do I Choose? por
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
808 visualizações26 slides
Data Lakehouse, Data Mesh, and Data Fabric (r2) por
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
6.3K visualizações30 slides
Data Lake Architecture – Modern Strategies & Approaches por
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
4.7K visualizações31 slides

Mais conteúdo relacionado

Mais procurados

Achieving Lakehouse Models with Spark 3.0 por
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
622 visualizações25 slides
Databricks + Snowflake: Catalyzing Data and AI Initiatives por
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
1.8K visualizações22 slides
Time to Talk about Data Mesh por
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
467 visualizações21 slides
Data Mesh Part 4 Monolith to Mesh por
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
1.9K visualizações39 slides
Azure data analytics platform - A reference architecture por
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Rajesh Kumar
489 visualizações24 slides
Why Data Virtualization? An Introduction por
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionDenodo
2.4K visualizações35 slides

Mais procurados(20)

Achieving Lakehouse Models with Spark 3.0 por Databricks
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks622 visualizações
Databricks + Snowflake: Catalyzing Data and AI Initiatives por Databricks
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks1.8K visualizações
Time to Talk about Data Mesh por LibbySchulze
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze467 visualizações
Data Mesh Part 4 Monolith to Mesh por Jeffrey T. Pollock
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock1.9K visualizações
Azure data analytics platform - A reference architecture por Rajesh Kumar
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
Rajesh Kumar489 visualizações
Why Data Virtualization? An Introduction por Denodo
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
Denodo 2.4K visualizações
Big data architectures and the data lake por James Serra
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra54.1K visualizações
Data Lake Overview por James Serra
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra19.9K visualizações
Building the Data Lake with Azure Data Factory and Data Lake Analytics por Khalid Salama
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama13K visualizações
Building a Data Lake on AWS por Gary Stafford
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford184 visualizações
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... por Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K visualizações
Data Architecture Best Practices for Advanced Analytics por DATAVERSITY
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY924 visualizações
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard por Paris Data Engineers !
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !1.3K visualizações
Business Data Lake Best Practices por Capgemini
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
Capgemini4.2K visualizações
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A... por Cathrine Wilhelmsen
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen1K visualizações
Data Lake Architecture por DATAVERSITY
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
DATAVERSITY15.5K visualizações
Apache Iceberg Presentation for the St. Louis Big Data IDEA por Adam Doyle
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle615 visualizações
Databricks Platform.pptx por Alex Ivy
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy3.4K visualizações
Introduction To Data Vault - DAMA Oregon 2012 por Empowered Holdings, LLC
Introduction To Data Vault - DAMA Oregon 2012Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012
Empowered Holdings, LLC14.3K visualizações
Data Modelling is NOT just for RDBMS's por Christopher Bradley
Data Modelling is NOT just for RDBMS'sData Modelling is NOT just for RDBMS's
Data Modelling is NOT just for RDBMS's
Christopher Bradley3.8K visualizações

Similar a Data Warehouse Design and Best Practices

Building an Effective Data Warehouse Architecture por
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
138.5K visualizações34 slides
Data Vault Overview por
Data Vault OverviewData Vault Overview
Data Vault OverviewEmpowered Holdings, LLC
9K visualizações88 slides
Business Intelligence with SQL Server por
Business Intelligence with SQL ServerBusiness Intelligence with SQL Server
Business Intelligence with SQL ServerPeter Gfader
2.3K visualizações62 slides
ITReady DW Day2 por
ITReady DW Day2ITReady DW Day2
ITReady DW Day2Siwawong Wuttipongprasert
1.7K visualizações111 slides
CV | Sham Sunder | Data | Database | Business Intelligence | .Net por
CV | Sham Sunder | Data | Database | Business Intelligence | .NetCV | Sham Sunder | Data | Database | Business Intelligence | .Net
CV | Sham Sunder | Data | Database | Business Intelligence | .NetSham Sunder
34 visualizações2 slides
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys por
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
2.3K visualizações31 slides

Similar a Data Warehouse Design and Best Practices(20)

Building an Effective Data Warehouse Architecture por James Serra
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra138.5K visualizações
Business Intelligence with SQL Server por Peter Gfader
Business Intelligence with SQL ServerBusiness Intelligence with SQL Server
Business Intelligence with SQL Server
Peter Gfader2.3K visualizações
CV | Sham Sunder | Data | Database | Business Intelligence | .Net por Sham Sunder
CV | Sham Sunder | Data | Database | Business Intelligence | .NetCV | Sham Sunder | Data | Database | Business Intelligence | .Net
CV | Sham Sunder | Data | Database | Business Intelligence | .Net
Sham Sunder34 visualizações
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys por NEWYORKSYS-IT SOLUTIONS
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
NEWYORKSYS-IT SOLUTIONS2.3K visualizações
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411 por Mark Tabladillo
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Mark Tabladillo575 visualizações
Modern data warehouse por Elena Lopez
Modern data warehouseModern data warehouse
Modern data warehouse
Elena Lopez30 visualizações
MinneBar 2013 - Scaling with Cassandra por Jeff Smoley
MinneBar 2013 - Scaling with CassandraMinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with Cassandra
Jeff Smoley1K visualizações
AnalysisServices por webuploader
AnalysisServicesAnalysisServices
AnalysisServices
webuploader311 visualizações
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks por Grega Kespret
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret1.5K visualizações
Building Analytic Apps for SaaS: “Analytics as a Service” por Amazon Web Services
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
Amazon Web Services888 visualizações
Datawarehousing & DSS por Deepali Raut
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
Deepali Raut6.7K visualizações
Big Data Analytics in the Cloud with Microsoft Azure por Mark Kromer
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer2K visualizações
Arquitectura de Datos en Azure por Elena Lopez
Arquitectura de Datos en AzureArquitectura de Datos en Azure
Arquitectura de Datos en Azure
Elena Lopez61 visualizações
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling por Kent Graziano
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Kent Graziano12.6K visualizações
2014.11.14 Data Opportunities with Azure por Marco Parenzan
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure
Marco Parenzan432 visualizações
Overview of business intelligence por Ahsan Kabir
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
Ahsan Kabir300 visualizações
OLAP Cubes in Datawarehousing por Prithwis Mukerjee
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
Prithwis Mukerjee10.1K visualizações
Best Practices for Building and Deploying Data Pipelines in Apache Spark por Databricks
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks1.9K visualizações

Mais de Ivo Andreev

How do OpenAI GPT Models Work - Misconceptions and Tips for Developers por
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersIvo Andreev
46 visualizações36 slides
OpenAI GPT in Depth - Questions and Misconceptions por
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsIvo Andreev
99 visualizações36 slides
Cutting Edge Computer Vision for Everyone por
Cutting Edge Computer Vision for EveryoneCutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for EveryoneIvo Andreev
17 visualizações39 slides
Collecting and Analysing Spaceborn Data por
Collecting and Analysing Spaceborn DataCollecting and Analysing Spaceborn Data
Collecting and Analysing Spaceborn DataIvo Andreev
32 visualizações42 slides
Collecting and Analysing Satellite Data with Azure Orbital por
Collecting and Analysing Satellite Data with Azure OrbitalCollecting and Analysing Satellite Data with Azure Orbital
Collecting and Analysing Satellite Data with Azure OrbitalIvo Andreev
93 visualizações42 slides
CosmosDB for IoT Scenarios por
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT ScenariosIvo Andreev
107 visualizações13 slides

Mais de Ivo Andreev(20)

How do OpenAI GPT Models Work - Misconceptions and Tips for Developers por Ivo Andreev
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
Ivo Andreev46 visualizações
OpenAI GPT in Depth - Questions and Misconceptions por Ivo Andreev
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
Ivo Andreev99 visualizações
Cutting Edge Computer Vision for Everyone por Ivo Andreev
Cutting Edge Computer Vision for EveryoneCutting Edge Computer Vision for Everyone
Cutting Edge Computer Vision for Everyone
Ivo Andreev17 visualizações
Collecting and Analysing Spaceborn Data por Ivo Andreev
Collecting and Analysing Spaceborn DataCollecting and Analysing Spaceborn Data
Collecting and Analysing Spaceborn Data
Ivo Andreev32 visualizações
Collecting and Analysing Satellite Data with Azure Orbital por Ivo Andreev
Collecting and Analysing Satellite Data with Azure OrbitalCollecting and Analysing Satellite Data with Azure Orbital
Collecting and Analysing Satellite Data with Azure Orbital
Ivo Andreev93 visualizações
CosmosDB for IoT Scenarios por Ivo Andreev
CosmosDB for IoT ScenariosCosmosDB for IoT Scenarios
CosmosDB for IoT Scenarios
Ivo Andreev107 visualizações
Forecasting time series powerful and simple por Ivo Andreev
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
Ivo Andreev183 visualizações
Constrained Optimization with Genetic Algorithms and Project Bonsai por Ivo Andreev
Constrained Optimization with Genetic Algorithms and Project BonsaiConstrained Optimization with Genetic Algorithms and Project Bonsai
Constrained Optimization with Genetic Algorithms and Project Bonsai
Ivo Andreev232 visualizações
Azure security guidelines for developers por Ivo Andreev
Azure security guidelines for developers Azure security guidelines for developers
Azure security guidelines for developers
Ivo Andreev72 visualizações
Autonomous Machines with Project Bonsai por Ivo Andreev
Autonomous Machines with Project BonsaiAutonomous Machines with Project Bonsai
Autonomous Machines with Project Bonsai
Ivo Andreev4.9K visualizações
Global azure virtual 2021 - Azure Lighthouse por Ivo Andreev
Global azure virtual 2021 - Azure LighthouseGlobal azure virtual 2021 - Azure Lighthouse
Global azure virtual 2021 - Azure Lighthouse
Ivo Andreev2.6K visualizações
Flux QL - Nexgen Management of Time Series Inspired by JS por Ivo Andreev
Flux QL - Nexgen Management of Time Series Inspired by JSFlux QL - Nexgen Management of Time Series Inspired by JS
Flux QL - Nexgen Management of Time Series Inspired by JS
Ivo Andreev2.5K visualizações
Azure architecture design patterns - proven solutions to common challenges por Ivo Andreev
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
Ivo Andreev4.3K visualizações
Industrial IoT on Azure por Ivo Andreev
Industrial IoT on AzureIndustrial IoT on Azure
Industrial IoT on Azure
Ivo Andreev7.1K visualizações
The Power of Auto ML and How Does it Work por Ivo Andreev
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev2.8K visualizações
Flying a Drone with JavaScript and Computer Vision por Ivo Andreev
Flying a Drone with JavaScript and Computer VisionFlying a Drone with JavaScript and Computer Vision
Flying a Drone with JavaScript and Computer Vision
Ivo Andreev1.3K visualizações
ML with Power BI for Business and Pros por Ivo Andreev
ML with Power BI for Business and ProsML with Power BI for Business and Pros
ML with Power BI for Business and Pros
Ivo Andreev9.2K visualizações
Industrial IoT with Azure and Open Source por Ivo Andreev
Industrial IoT with Azure and Open SourceIndustrial IoT with Azure and Open Source
Industrial IoT with Azure and Open Source
Ivo Andreev12.1K visualizações
Machine Learning at Hand with Power BI por Ivo Andreev
Machine Learning at Hand with Power BIMachine Learning at Hand with Power BI
Machine Learning at Hand with Power BI
Ivo Andreev4.8K visualizações
Python Development in VS2019 por Ivo Andreev
Python Development in VS2019Python Development in VS2019
Python Development in VS2019
Ivo Andreev3.2K visualizações

Último

DGST Methodology Presentation.pdf por
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdfmaddierlegum
8 visualizações9 slides
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between... por
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...DataScienceConferenc1
5 visualizações9 slides
AIMS-EREA.pdf por
AIMS-EREA.pdfAIMS-EREA.pdf
AIMS-EREA.pdfSudarson Roy Pratihar
8 visualizações18 slides
Construction Accidents & Injuries por
Construction Accidents & InjuriesConstruction Accidents & Injuries
Construction Accidents & InjuriesBisnar Chase Personal Injury Attorneys
7 visualizações5 slides
K-Drama Recommendation Using Python por
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using PythonFridaPutriassa
9 visualizações20 slides
Infomatica-MDM.pptx por
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptxKapil Rangwani
13 visualizações16 slides

Último(20)

DGST Methodology Presentation.pdf por maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum8 visualizações
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between... por DataScienceConferenc1
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
DataScienceConferenc15 visualizações
K-Drama Recommendation Using Python por FridaPutriassa
K-Drama Recommendation Using PythonK-Drama Recommendation Using Python
K-Drama Recommendation Using Python
FridaPutriassa9 visualizações
Infomatica-MDM.pptx por Kapil Rangwani
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptx
Kapil Rangwani13 visualizações
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion por Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Bertram Ludäscher11 visualizações
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange por RNayak3
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeAnalytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
RNayak35 visualizações
Listed Instruments Survey 2022.pptx por secretariat4
Listed Instruments Survey  2022.pptxListed Instruments Survey  2022.pptx
Listed Instruments Survey 2022.pptx
secretariat4148 visualizações
Employees attrition por MaryAlejandraDiaz
Employees attritionEmployees attrition
Employees attrition
MaryAlejandraDiaz8 visualizações
4_4_WP_4_06_ND_Model.pptx por d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 visualizações
GDG Community Day 2023 - Interpretable ML in production por SARADINDU SENGUPTA
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production
SARADINDU SENGUPTA7 visualizações
Custom Tag Manager Templates por Markus Baersch
Custom Tag Manager TemplatesCustom Tag Manager Templates
Custom Tag Manager Templates
Markus Baersch31 visualizações
DGIQ East 2023 AI Ethics SIG por Karen Lopez
DGIQ East 2023 AI Ethics SIGDGIQ East 2023 AI Ethics SIG
DGIQ East 2023 AI Ethics SIG
Karen Lopez6 visualizações
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... por patiladiti752
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
patiladiti7529 visualizações
Lack of communication among family.pptx por ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402317 visualizações
Product Research sample.pdf por AllenSingson
Product Research sample.pdfProduct Research sample.pdf
Product Research sample.pdf
AllenSingson35 visualizações
Oral presentation (1).pdf por reemalmazroui8
Oral presentation (1).pdfOral presentation (1).pdf
Oral presentation (1).pdf
reemalmazroui86 visualizações

Data Warehouse Design and Best Practices

  • 1. Data Warehouse Design Best Practices
  • 2. About me Project Manager @ 12 years professional experience .NET Web Development MCPD SQL Server 2012 (MCSA) Business Interests Web Development, SOA, Integration Security Performance Optimization Horizon2020, Open BIM, GIS, Mapping Contact me ivelin.andreev@icb.bg www.linkedin.com/in/ivelin www.slideshare.net/ivoandreev 2 |
  • 3. About me Senior Developer @ .NET Web Development MCPD Business Interests Web Development, WCF, Integration SQL Server – Query Optimization and Tuning Data Warehousing Contact me georgi.mishev@icb.bg www.linkedin.com/in/georgimishev
  • 5. Agenda Why Data Warehouse Main DW Architectures Dimensional Modeling Patterns Practices DW Maintenance ETL Process SSIS Demo
  • 6. Lots of Data Everywhere Can’t find data? Data scattered over the network Can’t get data? Need an expert to get the data Can’t understand data? Data poorly documented Can’t use data found? Data needs to be transformed
  • 7. Data Warehouse? Def: Central repository where data are organized, cleansed and in standardized format. Integrated Heterogeneous sources Data clean and conversion ($, €, 元) Focus on subject i.e. Customer, Sale, Product Time variant Timestamp every key Historical data (10+ years)
  • 8. Different Problems - Different Solutions OLTP Database Data Warehouse Users Customer Knowledge worker Design Normalized, Data Integrity Denormalized Function Daily operation Decision making Data Current, Detailed Historical, Aggregated Usage Real time Ad-hoc Access Short R/W transactions Complex R/O queries Data accessed Comparatively lower Large Amounts # Records x100 x1’000’000 # Users x1’000 x10 DB Size x10 GB x100GB-TB
  • 10. B.Inmon Model Top-Down Approach Warehouse (3NF) Data Mart OLAP (MD) http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640h=368
  • 11. R.Kimball Model Bottom-Up Approach Data Marts (3NF or MD) Warehouse OLAP (MD) http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640h=369
  • 12. Data Vault (by Dan Linstedt) Hubs List of unique business keys Links Unique relationships between keys Satellites Hub and Link details and history
  • 13. It is irrelevant which camp you belong… as far as you understand why!
  • 14. Making Your Choice • Kimball (MD) + Start small, scale big + Faster ROI + Analytical tools - Low reusability • Data Vault • Inmon (3NF) + Structured + Easy to maintain + Easier data mining - Timely to build Backend Data Warehouse + Multiple sources; Full history; Incremental build - Up-front work; Long-term payoff; Many joins
  • 15. Dimensional modeling as de-facto standard
  • 16. Dimensions Def: The object of BI interest Keys Surrogate key Business key Hierarchical attributes Analysis and Drill Down Member properties Presentation labels Auditing information (not for end users)
  • 17. Slowly Changing Dimensions Def: Scheme for recording changes over time Type 1 - Overwrite Type 2 – Multiple Records
  • 18. Facts Def: Measurement of a business process Keys FK from all dimensional tables (in the star) PK - Composite (usually) or Surrogate Measures Numeric columns, that are of interest to the business Additive, Non-additive, Semi-additive Factless facts Auditing information (optional)
  • 20. Data Warehouse Pitfalls Admit it is not as it seems to be You need education Find what is of business value Rather than focus on performance Spend a lot of time in Extract-Transform-Load Homogenize data from different sources Find (and resolve) problems in source systems
  • 21. Prepare your Sources Data integrity Avoid redundancy Data quality Master data source Data validation Auditing CreatedDate / CreatedBy ChangedDate / ChangedBy Nightly jobs
  • 22. Dimension Design Business key with non-clustered index Include date (if dimension has history) Surrogate key The smallest possible integer Clustered index FK constraints Do not enforce (WITH NOCHECK) Document the relation Faster load Data validation Task for the Source system
  • 23. Conformed Dimensions Def. Having the same meaning and content when referred from multiple fact tables Date Dimension Partitioning best candidate Granularity Do not store every hour, when reporting daily Avoid surrogate keys Saves lookup and joins Integer representing date (yyyyMMdd, days after 1/1/1900)
  • 24. Pre-join Hierarchies Recursive relationships Fast drill and report Pre-computed aggregations Hierarchy Bridge For each dimension row 1 association with self 1 row for each subordinate
  • 25. Determine the Facts The center of a Star schema Identify subject areas Identify key business events Identify dimensions Start from OLTP logical model Identify historical requirements Identify attributes
  • 26. The Grain Def: The level of detail of a fact table What is the business objective? Fine grain - behaviour and frequency analysis Coarse grain - overall and trend analysis Aggregates DO NOT summarize prematurely DO NOT mix detail and summary DO use “summary tables”
  • 27. C3-PO is fluent in 6M forms of communication. What about your customers?
  • 28. Multinational DW What parts need translation? Where to store various language versions? How to support future languages? Dimensions Add language attribute Include text data in the dimension Problem 1: The dimension key? Replicate PK for every language Fact.DimId = Dim.Id AND Dim.Lang=[Lang] Problem 2: Storage = [Dim] x [Lang] Sub-dimension with language attributes TxtId Attr1 Attr2 LangId 1 large Yes En 2 small No En 1 stor Ja No 2 liten Nei No 3 … … …
  • 30. How Large is “Large” Is big really big?
  • 31. Partitioning Why Faster index maintenance Faster load Faster queries When Tables 10GB+ How Do not partition dimension tables Partition by date (most analysis are time-based) Eliminate partitions (WHERE [PartitionKey]=…) Avoid split and merge of existing partitions Can cause inefficient log generation
  • 32. Columnstore Index Non-clustered in SQL 2012 Clustered in SQL 2014 Pros Better data compression High performance on table scan Clustered CSI Limitations No other indexes allowed Little advantage on seek operations No XML, computed column or replication
  • 33. Extract-Transform-Load Extract data from OLTP Data transformations Data loads DW maintenance
  • 34. Efficient Load Process Use simple recovery model during data load Staging Avoid indexing Populate in parallel Maintain DW Disable indexes on load Rebuild manually after load Automatic stats update slow down SQL Server
  • 35. To SSIS, or not to SSIS ? Pros Minimum coding to none Extensive support of various data sources Parallel execution of migration tasks Better organization of the ETL process Cons Another way of thinking Hidden options T-SQL developer would do much faster Auto-generated flows need optimization Sometimes simply does not work (i.e. Sort by GUID)
  • 37. Takeaways Books The Data Warehouse Toolkit (3rd ed), Ralph Kimball Implementing DW with Microsoft SQL Server 2012 Data Warehousing Fundamentals, Paulraj Ponniah Articles Best Practices in Data Warehouse (Hanover Research Council) http://www.kimballgroup.com/category/design-tips/ http://sqlmag.com/business-intelligence Resources http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/ dimensional-modeling-techniques/ http://www.databaseanswers.org/data_models/index.htm