SlideShare uma empresa Scribd logo
1 de 29
Datawarehousebest practices Dr.  Eduardo Castro, MSc ecastro@simsasys.com http://ecastrom.blogspot.com http://comunidadwindows.org http://tiny.cc/comwindows Facebook: ecastrom Twitter: edocastro
Sources This presentation is based on the following sources Datawarehouse Ravi RanJan Top 10 Best Practices for Building a Large Scale Relational Data Warehouse SQL CAT
Complexities of Creating a Data Warehouse Incomplete errors  Missing Fields Records or Fields That, by Design, are not Being Recorded Incorrect errors Wrong Calculations, Aggregations Duplicate Records Wrong Information Entered into Source System Source. Datawarehouse. Ravi RanJan
Data Warehouse Pitfalls You are going to spend much time extracting, cleaning, and loading data You are going to find problems with systems feeding the data warehouse You will find the need to store/validate data not being captured/validated by any existing system Large scale data warehousing can become an exercise in data homogenizing Source. Datawarehouse. Ravi RanJan
Data Warehouse Pitfalls… The time it takes to load the warehouse will expand to the amount of the time in the available window... and then some You are building a HIGH maintenance system You will fail if you concentrate on resource optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customer Source. Datawarehouse. Ravi RanJan
Best Practices Complete requirements and design Prototyping is key to business understanding Utilizing proper aggregations and detailed data Training is an on-going process Build data integrity checks into your system. Source. Datawarehouse. Ravi RanJan
Top 10 Best Practices for Building a Large Scale Relational Data Warehouse Building a large scale relational data warehouse is a complex task.  This section describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning.  Most of these tips are based on experiences building large data warehouses on SQL Server  Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Consider partitioning large fact tables  Consider partitioning fact tables that are 50 to 100GB or larger.  Partitioning can provide manageability and often performance benefits. Faster, more granular index maintenance. More flexible backup / restore options. Faster data loading and deleting Faster queries when restricted to a single partition.. Typically partition the fact table on the date key. Enables sliding window. Enables partition elimination. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Build clustered index on the date key of the fact table This supports efficient queries to populate cubes or retrieve a historical data slice. If you load data in a batch window for the clustered index on the fact table then use the options 	ALLOW_ROW_LOCKS = OFF and 	ALLOW_PAGE_LOCKS = OFF  This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Build clustered index on the date key of the fact table Build nonclustered indexes for each foreign key.  This helps ‘pinpoint queries' to extract rows based on a selective dimension predicate. Use filegroups for administration requirements such as backup / restore, partial database availability, etc. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Choose partition grain carefully Most customers use month, quarter, or year. For efficient deletes, you must delete one full partition at a time. It is faster to load a complete partition at a time. Daily partitions for daily loads may be an attractive option. However, keep in mind that a table can have a maximum of 1000 partitions. Partition grain affects query parallelism.  Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Choose partition grain carefully For SQL Server 2005: Queries touching a single partition can parallelize up to MAXDOP (maximum degree of parallelism).  Queries touching multiple partitions use one thread per partition up to MAXDOP. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Choose partition grain carefully For SQL Server 2008: Parallel threads up to MAXDOP are distributed proportionally to scan partitions, and multiple threads per partition may be used even when several partitions must be scanned. Avoid a partition design where only 2 or 3 partitions are touched by frequent queries, if you need MAXDOP parallelism (assuming MAXDOP =4 or larger).  Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Design dimension tables appropriately Use integer surrogate keys for all dimensions, other than the Date dimension.  Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow. Use a meaningful date key of integer type derivable from the DATETIME data type (for example: 20060215). Don't use a surrogate Key for the Date dimension Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Design dimension tables appropriately Build a clustered index on the surrogate key for each dimension table Build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads. Build nonclustered indexes on other frequently searched dimension columns. Avoid partitioning dimension tables. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Design dimension tables appropriately Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them.  Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Write effective queries for partition elimination Whenever possible, place a query predicate (WHERE condition) directly on the partitioning key (Date dimension key) of the fact table. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Use Sliding Window technique to maintain data Maintain a rolling time window for online access to the fact tables. Load newest data, unload oldest data. Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement. Avoid split or merge of populated partitions. Splitting or merging populated partitions can be extremely inefficient, as this may cause as much as 4 times more log generation, and also cause severe locking. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Use Sliding Window technique to maintain data Create the load staging table in the same filegroup as the partition you are loading. Create the unload staging table in the same filegroup as the partition you are deleteing. It is fastest to load newest full partition at one time, but only possible when partition size is equal to the data load frequency (for example, you have one partition per day, and you load data once per day). Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Use Sliding Window technique to maintain data If the partition size doesn't match the data load frequency, incrementally load the latest partition.  Various options for loading bulk data into a partitioned table are discussed in the whitepaper  http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/loading_bulk_data_partitioned_table.mspx. Always unload one partition at a time. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Efficiently load the initial data Use SIMPLE or BULK LOGGED recovery model during the initial data load. Create the partitioned fact table with the Clustered index. Create non-indexed staging tables for each partition, and separate source data files for populating each partition. Populate the staging tables in parallel. Use multiple BULK INSERT, BCP or SSIS tasks. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Efficiently load the initial data Create as many load scripts to run in parallel as there are CPUs, if there is no IO bottleneck. If IO bandwidth is limited, use fewer scripts in parallel. Use 0 batch size in the load. Use 0 commit size in the load.  Use TABLOCK. Use BULK INSERT if the sources are flat files on the same server. Use BCP or SSIS if data is being pushed from remote machines. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Efficiently load the initial data Build a clustered index on each staging table, then create appropriate CHECK constraints. SWITCH all partitions into the partitioned table. Build nonclustered indexes on the partitioned table. Possible to load 1 TB in under an hour on a 64-CPU server with a SAN capable of 14 GB/Sec throughput (non-indexed table).  Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Efficiently delete old data Use partition switching whenever possible. To delete millions of rows from nonpartitioned, indexed tables Avoid DELETE FROM ...WHERE ... Huge locking and logging issues  Long rollback if the delete is canceled Usually faster to  INSERT the records to keep into a non-indexed table Create index(es) on the table Rename the new table to replace the original Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Efficiently delete old data As an alternative, ‘trickle' deletes using the following repeatedly in a loop	DELETE TOP (1000) ... ; 	COMMIT Another alternative is to update the row to mark as deleted, then delete later during non critical time.  Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Manage statistics manually Statistics on partitioned tables are maintained for the table as a whole. Manually update statistics on large fact tables after loading new data. Manually update statistics after rebuilding index on a partition. If you regularly update statistics after periodic loads, you may turn off autostats on that table. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Manage statistics manually This is important for optimizing queries that may need to read only the newest data. Updating statistics on small dimension tables after incremental loads may also help performance.  Use FULLSCAN option on update statistics on dimension tables for more accurate query plans. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Consider efficient backup strategies  Backing up the entire database may take significant amount of time for a very large database. For example, backing up a 2 TB database to a 10-spindle RAID-5 disk on a SAN may take 2 hours (at the rate 275 MB/sec). Snapshot backup using SAN technology is a very good option. Reduce the volume of data to backup regularly. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT
Consider efficient backup strategies  The filegroups for the historical partitions can be marked as READ ONLY. Perform a filegroup backup once when a filegroup becomes read-only. Perform regular backups only on the read / write filegroups.  Note that RESTOREs of the read-only filegroups cannot be performed in parallel. Source. Top 10 Best Practices for Building  Large Scale Relational Data Warehouse SQL CAT

Mais conteúdo relacionado

Mais procurados

How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostHow to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostAtScale
 
Role of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingRole of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingVenu Anuganti
 
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...Roland Bouman
 
Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouseUday Kothari
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 
How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data WarehouseEric Sun
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTechWell
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data LakeCalum Miller
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the adminTillmann Eitelberg
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Cloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseBui Ha
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
 

Mais procurados (20)

Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data lake
Data lakeData lake
Data lake
 
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the CostHow to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
 
Role of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, WarehousingRole of MySQL in Data Analytics, Warehousing
Role of MySQL in Data Analytics, Warehousing
 
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
 
Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouse
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data Warehouse
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Cloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure Hunt
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 

Semelhante a Data Warehouse Best Practices

Large scale sql server best practices
Large scale sql server   best practicesLarge scale sql server   best practices
Large scale sql server best practicesmprabhuram
 
Optimize access
Optimize accessOptimize access
Optimize accessAla Esmail
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
How to Fine-Tune Performance Using Amazon Redshift
How to Fine-Tune Performance Using Amazon RedshiftHow to Fine-Tune Performance Using Amazon Redshift
How to Fine-Tune Performance Using Amazon RedshiftAWS Germany
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designCalpont
 
The High Performance DBA Optimizing Databases For High Performance
The High Performance DBA Optimizing Databases For High PerformanceThe High Performance DBA Optimizing Databases For High Performance
The High Performance DBA Optimizing Databases For High PerformanceEmbarcadero Technologies
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008paulguerin
 
Tips for Database Performance
Tips for Database PerformanceTips for Database Performance
Tips for Database PerformanceKesavan Munuswamy
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018Dave Stokes
 
Sql interview question part 5
Sql interview question part 5Sql interview question part 5
Sql interview question part 5kaashiv1
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoDave Stokes
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and ElasticsearchDean Hamstead
 
Crystal xcelsius best practices and workflows for building enterprise solut...
Crystal xcelsius   best practices and workflows for building enterprise solut...Crystal xcelsius   best practices and workflows for building enterprise solut...
Crystal xcelsius best practices and workflows for building enterprise solut...Yogeeswar Reddy
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersAdam Hutson
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 
Applyinga blockcentricapproach
Applyinga blockcentricapproachApplyinga blockcentricapproach
Applyinga blockcentricapproachoracle documents
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 

Semelhante a Data Warehouse Best Practices (20)

Large scale sql server best practices
Large scale sql server   best practicesLarge scale sql server   best practices
Large scale sql server best practices
 
Optimize access
Optimize accessOptimize access
Optimize access
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
How to Fine-Tune Performance Using Amazon Redshift
How to Fine-Tune Performance Using Amazon RedshiftHow to Fine-Tune Performance Using Amazon Redshift
How to Fine-Tune Performance Using Amazon Redshift
 
Managing SQLserver
Managing SQLserverManaging SQLserver
Managing SQLserver
 
The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
The High Performance DBA Optimizing Databases For High Performance
The High Performance DBA Optimizing Databases For High PerformanceThe High Performance DBA Optimizing Databases For High Performance
The High Performance DBA Optimizing Databases For High Performance
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
 
Tips for Database Performance
Tips for Database PerformanceTips for Database Performance
Tips for Database Performance
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
 
Ebook5
Ebook5Ebook5
Ebook5
 
Sql interview question part 5
Sql interview question part 5Sql interview question part 5
Sql interview question part 5
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 
Perl and Elasticsearch
Perl and ElasticsearchPerl and Elasticsearch
Perl and Elasticsearch
 
Crystal xcelsius best practices and workflows for building enterprise solut...
Crystal xcelsius   best practices and workflows for building enterprise solut...Crystal xcelsius   best practices and workflows for building enterprise solut...
Crystal xcelsius best practices and workflows for building enterprise solut...
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
Applyinga blockcentricapproach
Applyinga blockcentricapproachApplyinga blockcentricapproach
Applyinga blockcentricapproach
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 

Mais de Eduardo Castro

Introducción a polybase en SQL Server
Introducción a polybase en SQL ServerIntroducción a polybase en SQL Server
Introducción a polybase en SQL ServerEduardo Castro
 
Creando tu primer ambiente de AI en Azure ML y SQL Server
Creando tu primer ambiente de AI en Azure ML y SQL ServerCreando tu primer ambiente de AI en Azure ML y SQL Server
Creando tu primer ambiente de AI en Azure ML y SQL ServerEduardo Castro
 
Seguridad en SQL Azure
Seguridad en SQL AzureSeguridad en SQL Azure
Seguridad en SQL AzureEduardo Castro
 
Azure Synapse Analytics MLflow
Azure Synapse Analytics MLflowAzure Synapse Analytics MLflow
Azure Synapse Analytics MLflowEduardo Castro
 
SQL Server 2019 con Windows Server 2022
SQL Server 2019 con Windows Server 2022SQL Server 2019 con Windows Server 2022
SQL Server 2019 con Windows Server 2022Eduardo Castro
 
Novedades en SQL Server 2022
Novedades en SQL Server 2022Novedades en SQL Server 2022
Novedades en SQL Server 2022Eduardo Castro
 
Introduccion a SQL Server 2022
Introduccion a SQL Server 2022Introduccion a SQL Server 2022
Introduccion a SQL Server 2022Eduardo Castro
 
Machine Learning con Azure Managed Instance
Machine Learning con Azure Managed InstanceMachine Learning con Azure Managed Instance
Machine Learning con Azure Managed InstanceEduardo Castro
 
Novedades en sql server 2022
Novedades en sql server 2022Novedades en sql server 2022
Novedades en sql server 2022Eduardo Castro
 
Sql server 2019 con windows server 2022
Sql server 2019 con windows server 2022Sql server 2019 con windows server 2022
Sql server 2019 con windows server 2022Eduardo Castro
 
Introduccion a databricks
Introduccion a databricksIntroduccion a databricks
Introduccion a databricksEduardo Castro
 
Pronosticos con sql server
Pronosticos con sql serverPronosticos con sql server
Pronosticos con sql serverEduardo Castro
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
Que hay de nuevo en el Azure Data Lake Storage Gen2
Que hay de nuevo en el Azure Data Lake Storage Gen2Que hay de nuevo en el Azure Data Lake Storage Gen2
Que hay de nuevo en el Azure Data Lake Storage Gen2Eduardo Castro
 
Introduccion a Azure Synapse Analytics
Introduccion a Azure Synapse AnalyticsIntroduccion a Azure Synapse Analytics
Introduccion a Azure Synapse AnalyticsEduardo Castro
 
Seguridad de SQL Database en Azure
Seguridad de SQL Database en AzureSeguridad de SQL Database en Azure
Seguridad de SQL Database en AzureEduardo Castro
 
Python dentro de SQL Server
Python dentro de SQL ServerPython dentro de SQL Server
Python dentro de SQL ServerEduardo Castro
 
Servicios Cognitivos de de Microsoft
Servicios Cognitivos de de Microsoft Servicios Cognitivos de de Microsoft
Servicios Cognitivos de de Microsoft Eduardo Castro
 
Script de paso a paso de configuración de Secure Enclaves
Script de paso a paso de configuración de Secure EnclavesScript de paso a paso de configuración de Secure Enclaves
Script de paso a paso de configuración de Secure EnclavesEduardo Castro
 
Introducción a conceptos de SQL Server Secure Enclaves
Introducción a conceptos de SQL Server Secure EnclavesIntroducción a conceptos de SQL Server Secure Enclaves
Introducción a conceptos de SQL Server Secure EnclavesEduardo Castro
 

Mais de Eduardo Castro (20)

Introducción a polybase en SQL Server
Introducción a polybase en SQL ServerIntroducción a polybase en SQL Server
Introducción a polybase en SQL Server
 
Creando tu primer ambiente de AI en Azure ML y SQL Server
Creando tu primer ambiente de AI en Azure ML y SQL ServerCreando tu primer ambiente de AI en Azure ML y SQL Server
Creando tu primer ambiente de AI en Azure ML y SQL Server
 
Seguridad en SQL Azure
Seguridad en SQL AzureSeguridad en SQL Azure
Seguridad en SQL Azure
 
Azure Synapse Analytics MLflow
Azure Synapse Analytics MLflowAzure Synapse Analytics MLflow
Azure Synapse Analytics MLflow
 
SQL Server 2019 con Windows Server 2022
SQL Server 2019 con Windows Server 2022SQL Server 2019 con Windows Server 2022
SQL Server 2019 con Windows Server 2022
 
Novedades en SQL Server 2022
Novedades en SQL Server 2022Novedades en SQL Server 2022
Novedades en SQL Server 2022
 
Introduccion a SQL Server 2022
Introduccion a SQL Server 2022Introduccion a SQL Server 2022
Introduccion a SQL Server 2022
 
Machine Learning con Azure Managed Instance
Machine Learning con Azure Managed InstanceMachine Learning con Azure Managed Instance
Machine Learning con Azure Managed Instance
 
Novedades en sql server 2022
Novedades en sql server 2022Novedades en sql server 2022
Novedades en sql server 2022
 
Sql server 2019 con windows server 2022
Sql server 2019 con windows server 2022Sql server 2019 con windows server 2022
Sql server 2019 con windows server 2022
 
Introduccion a databricks
Introduccion a databricksIntroduccion a databricks
Introduccion a databricks
 
Pronosticos con sql server
Pronosticos con sql serverPronosticos con sql server
Pronosticos con sql server
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Que hay de nuevo en el Azure Data Lake Storage Gen2
Que hay de nuevo en el Azure Data Lake Storage Gen2Que hay de nuevo en el Azure Data Lake Storage Gen2
Que hay de nuevo en el Azure Data Lake Storage Gen2
 
Introduccion a Azure Synapse Analytics
Introduccion a Azure Synapse AnalyticsIntroduccion a Azure Synapse Analytics
Introduccion a Azure Synapse Analytics
 
Seguridad de SQL Database en Azure
Seguridad de SQL Database en AzureSeguridad de SQL Database en Azure
Seguridad de SQL Database en Azure
 
Python dentro de SQL Server
Python dentro de SQL ServerPython dentro de SQL Server
Python dentro de SQL Server
 
Servicios Cognitivos de de Microsoft
Servicios Cognitivos de de Microsoft Servicios Cognitivos de de Microsoft
Servicios Cognitivos de de Microsoft
 
Script de paso a paso de configuración de Secure Enclaves
Script de paso a paso de configuración de Secure EnclavesScript de paso a paso de configuración de Secure Enclaves
Script de paso a paso de configuración de Secure Enclaves
 
Introducción a conceptos de SQL Server Secure Enclaves
Introducción a conceptos de SQL Server Secure EnclavesIntroducción a conceptos de SQL Server Secure Enclaves
Introducción a conceptos de SQL Server Secure Enclaves
 

Último

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Último (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Data Warehouse Best Practices

  • 1. Datawarehousebest practices Dr. Eduardo Castro, MSc ecastro@simsasys.com http://ecastrom.blogspot.com http://comunidadwindows.org http://tiny.cc/comwindows Facebook: ecastrom Twitter: edocastro
  • 2. Sources This presentation is based on the following sources Datawarehouse Ravi RanJan Top 10 Best Practices for Building a Large Scale Relational Data Warehouse SQL CAT
  • 3. Complexities of Creating a Data Warehouse Incomplete errors Missing Fields Records or Fields That, by Design, are not Being Recorded Incorrect errors Wrong Calculations, Aggregations Duplicate Records Wrong Information Entered into Source System Source. Datawarehouse. Ravi RanJan
  • 4. Data Warehouse Pitfalls You are going to spend much time extracting, cleaning, and loading data You are going to find problems with systems feeding the data warehouse You will find the need to store/validate data not being captured/validated by any existing system Large scale data warehousing can become an exercise in data homogenizing Source. Datawarehouse. Ravi RanJan
  • 5. Data Warehouse Pitfalls… The time it takes to load the warehouse will expand to the amount of the time in the available window... and then some You are building a HIGH maintenance system You will fail if you concentrate on resource optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customer Source. Datawarehouse. Ravi RanJan
  • 6. Best Practices Complete requirements and design Prototyping is key to business understanding Utilizing proper aggregations and detailed data Training is an on-going process Build data integrity checks into your system. Source. Datawarehouse. Ravi RanJan
  • 7. Top 10 Best Practices for Building a Large Scale Relational Data Warehouse Building a large scale relational data warehouse is a complex task. This section describes some design techniques that can help in architecting an efficient large scale relational data warehouse with SQL Server. Most large scale data warehouses use table and index partitioning, and therefore, many of the recommendations here involve partitioning. Most of these tips are based on experiences building large data warehouses on SQL Server Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 8. Consider partitioning large fact tables Consider partitioning fact tables that are 50 to 100GB or larger. Partitioning can provide manageability and often performance benefits. Faster, more granular index maintenance. More flexible backup / restore options. Faster data loading and deleting Faster queries when restricted to a single partition.. Typically partition the fact table on the date key. Enables sliding window. Enables partition elimination. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 9. Build clustered index on the date key of the fact table This supports efficient queries to populate cubes or retrieve a historical data slice. If you load data in a batch window for the clustered index on the fact table then use the options ALLOW_ROW_LOCKS = OFF and ALLOW_PAGE_LOCKS = OFF This helps speed up table scan operations during query time and helps avoid excessive locking activity during large updates. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 10. Build clustered index on the date key of the fact table Build nonclustered indexes for each foreign key. This helps ‘pinpoint queries' to extract rows based on a selective dimension predicate. Use filegroups for administration requirements such as backup / restore, partial database availability, etc. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 11. Choose partition grain carefully Most customers use month, quarter, or year. For efficient deletes, you must delete one full partition at a time. It is faster to load a complete partition at a time. Daily partitions for daily loads may be an attractive option. However, keep in mind that a table can have a maximum of 1000 partitions. Partition grain affects query parallelism. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 12. Choose partition grain carefully For SQL Server 2005: Queries touching a single partition can parallelize up to MAXDOP (maximum degree of parallelism). Queries touching multiple partitions use one thread per partition up to MAXDOP. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 13. Choose partition grain carefully For SQL Server 2008: Parallel threads up to MAXDOP are distributed proportionally to scan partitions, and multiple threads per partition may be used even when several partitions must be scanned. Avoid a partition design where only 2 or 3 partitions are touched by frequent queries, if you need MAXDOP parallelism (assuming MAXDOP =4 or larger). Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 14. Design dimension tables appropriately Use integer surrogate keys for all dimensions, other than the Date dimension. Use the smallest possible integer for the dimension surrogate keys. This helps to keep fact table narrow. Use a meaningful date key of integer type derivable from the DATETIME data type (for example: 20060215). Don't use a surrogate Key for the Date dimension Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 15. Design dimension tables appropriately Build a clustered index on the surrogate key for each dimension table Build a non-clustered index on the Business Key (potentially combined with a row-effective-date) to support surrogate key lookups during loads. Build nonclustered indexes on other frequently searched dimension columns. Avoid partitioning dimension tables. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 16. Design dimension tables appropriately Avoid enforcing foreign key relationships between the fact and the dimension tables, to allow faster data loads. You can create foreign key constraints with NOCHECK to document the relationships; but don’t enforce them. Ensure data integrity though Transform Lookups, or perform the data integrity checks at the source of the data. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 17. Write effective queries for partition elimination Whenever possible, place a query predicate (WHERE condition) directly on the partitioning key (Date dimension key) of the fact table. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 18. Use Sliding Window technique to maintain data Maintain a rolling time window for online access to the fact tables. Load newest data, unload oldest data. Always keep empty partitions at both ends of the partition range to guarantee that the partition split (before loading new data) and partition merge (after unloading old data) do not incur any data movement. Avoid split or merge of populated partitions. Splitting or merging populated partitions can be extremely inefficient, as this may cause as much as 4 times more log generation, and also cause severe locking. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 19. Use Sliding Window technique to maintain data Create the load staging table in the same filegroup as the partition you are loading. Create the unload staging table in the same filegroup as the partition you are deleteing. It is fastest to load newest full partition at one time, but only possible when partition size is equal to the data load frequency (for example, you have one partition per day, and you load data once per day). Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 20. Use Sliding Window technique to maintain data If the partition size doesn't match the data load frequency, incrementally load the latest partition. Various options for loading bulk data into a partitioned table are discussed in the whitepaper http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/loading_bulk_data_partitioned_table.mspx. Always unload one partition at a time. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 21. Efficiently load the initial data Use SIMPLE or BULK LOGGED recovery model during the initial data load. Create the partitioned fact table with the Clustered index. Create non-indexed staging tables for each partition, and separate source data files for populating each partition. Populate the staging tables in parallel. Use multiple BULK INSERT, BCP or SSIS tasks. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 22. Efficiently load the initial data Create as many load scripts to run in parallel as there are CPUs, if there is no IO bottleneck. If IO bandwidth is limited, use fewer scripts in parallel. Use 0 batch size in the load. Use 0 commit size in the load. Use TABLOCK. Use BULK INSERT if the sources are flat files on the same server. Use BCP or SSIS if data is being pushed from remote machines. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 23. Efficiently load the initial data Build a clustered index on each staging table, then create appropriate CHECK constraints. SWITCH all partitions into the partitioned table. Build nonclustered indexes on the partitioned table. Possible to load 1 TB in under an hour on a 64-CPU server with a SAN capable of 14 GB/Sec throughput (non-indexed table). Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 24. Efficiently delete old data Use partition switching whenever possible. To delete millions of rows from nonpartitioned, indexed tables Avoid DELETE FROM ...WHERE ... Huge locking and logging issues Long rollback if the delete is canceled Usually faster to INSERT the records to keep into a non-indexed table Create index(es) on the table Rename the new table to replace the original Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 25. Efficiently delete old data As an alternative, ‘trickle' deletes using the following repeatedly in a loop DELETE TOP (1000) ... ; COMMIT Another alternative is to update the row to mark as deleted, then delete later during non critical time. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 26. Manage statistics manually Statistics on partitioned tables are maintained for the table as a whole. Manually update statistics on large fact tables after loading new data. Manually update statistics after rebuilding index on a partition. If you regularly update statistics after periodic loads, you may turn off autostats on that table. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 27. Manage statistics manually This is important for optimizing queries that may need to read only the newest data. Updating statistics on small dimension tables after incremental loads may also help performance. Use FULLSCAN option on update statistics on dimension tables for more accurate query plans. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 28. Consider efficient backup strategies Backing up the entire database may take significant amount of time for a very large database. For example, backing up a 2 TB database to a 10-spindle RAID-5 disk on a SAN may take 2 hours (at the rate 275 MB/sec). Snapshot backup using SAN technology is a very good option. Reduce the volume of data to backup regularly. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT
  • 29. Consider efficient backup strategies The filegroups for the historical partitions can be marked as READ ONLY. Perform a filegroup backup once when a filegroup becomes read-only. Perform regular backups only on the read / write filegroups. Note that RESTOREs of the read-only filegroups cannot be performed in parallel. Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT