Data Warehouse Best Practices

DATAWAREHOUSE
BEST PRACTICES
Dr. Eduardo Castro, MSc
ecastro@simsasys.com

http://ecastrom.blogspot.com
http://comunidadwindows.org
http://tiny.cc/comwindows
Facebook: ecastrom
Twitter: edocastro

SOURCES

This presentation is based on the following sources

Datawarehouse
Ravi RanJan
Top 10 Best Practices for Building a Large Scale Relational Data Warehouse
SQL CAT

Complexities of Creating a Data Warehouse

• Incomplete errors
• Missing Fields
• Records or Fields That, by Design, are not
Being Recorded

• Incorrect errors
• Wrong Calculations, Aggregations
• Duplicate Records
• Wrong Information Entered into Source System

Source. Datawarehouse. Ravi RanJan

Data Warehouse Pitfalls
• You are going to spend much time extracting, cleaning,
and loading data
• You are going to find problems with systems feeding the
data warehouse
• You will find the need to store/validate data not being
captured/validated by any existing system
• Large scale data warehousing can become an exercise
in data homogenizing


Data Warehouse Pitfalls…

• The time it takes to load the warehouse will expand
to the amount of the time in the available window...
and then some
• You are building a HIGH maintenance system
• You will fail if you concentrate on resource
optimization to the neglect of project, data, and
customer management issues and an understanding
of what adds value to the customer


Best Practices
• Complete requirements and design

• Prototyping is key to business understanding

• Utilizing proper aggregations and detailed data

• Training is an on-going process

• Build data integrity checks into your system.


Top 10 Best Practices for Building a Large
Scale Relational Data Warehouse
• Building a large scale relational data warehouse is a
complex task.
• This section describes some design techniques that can
help in architecting an efficient large scale relational data
warehouse with SQL Server.
• Most large scale data warehouses use table and index
partitioning, and therefore, many of the recommendations
here involve partitioning.
• Most of these tips are based on experiences building
large data warehouses on SQL Server

Source. Top 10 Best Practices for Building Large Scale Relational Data Warehouse SQL CAT

Consider partitioning large fact tables
• Consider partitioning fact tables that are 50 to 100GB or
larger.
• Partitioning can provide manageability and often
performance benefits.
• Faster, more granular index maintenance.
• More flexible backup / restore options.
• Faster data loading and deleting
• Faster queries when restricted to a single partition..
• Typically partition the fact table on the date key.
• Enables sliding window.
• Enables partition elimination.


Build clustered index on the date key of
the fact table
• This supports efficient queries to populate cubes or
retrieve a historical data slice.

• If you load data in a batch window for the clustered index
on the fact table then use the options
ALLOW_ROW_LOCKS = OFF and
ALLOW_PAGE_LOCKS = OFF

• This helps speed up table scan operations during query
time and helps avoid excessive locking activity during
large updates.


Build clustered index on the date key of
the fact table
• Build nonclustered indexes for each foreign key.
• This helps ‘pinpoint queries' to extract rows based on a selective
dimension predicate.

• Use filegroups for administration requirements such as
backup / restore, partial database availability, etc.


Choose partition grain carefully
• Most customers use month, quarter, or year.
• For efficient deletes, you must delete one full partition at a
time.
• It is faster to load a complete partition at a time.
• Daily partitions for daily loads may be an attractive option.
• However, keep in mind that a table can have a maximum of 1000
partitions.
• Partition grain affects query parallelism.


• For SQL Server 2005:

• Queries touching a single partition can parallelize up to MAXDOP
(maximum degree of parallelism).

• Queries touching multiple partitions use one thread per partition up
to MAXDOP.



• For SQL Server 2008:

• Parallel threads up to MAXDOP are distributed proportionally to
scan partitions, and multiple threads per partition may be used
even when several partitions must be scanned.

• Avoid a partition design where only 2 or 3 partitions are touched by
frequent queries, if you need MAXDOP parallelism (assuming
MAXDOP =4 or larger).


Design dimension tables appropriately
• Use integer surrogate keys for all dimensions, other than
the Date dimension.

• Use the smallest possible integer for the dimension
surrogate keys. This helps to keep fact table narrow.

• Use a meaningful date key of integer type derivable from
the DATETIME data type (for example: 20060215).

• Don't use a surrogate Key for the Date dimension


• Build a clustered index on the surrogate key for each
dimension table

• Build a non-clustered index on the Business Key
(potentially combined with a row-effective-date) to support
surrogate key lookups during loads.

• Build nonclustered indexes on other frequently searched
dimension columns.

• Avoid partitioning dimension tables.


• Avoid enforcing foreign key relationships between the fact
and the dimension tables, to allow faster data loads.

• You can create foreign key constraints with NOCHECK to
document the relationships; but don’t enforce them.

• Ensure data integrity though Transform Lookups, or
perform the data integrity checks at the source of the
data.


Write effective queries for partition
elimination
• Whenever possible, place a query predicate (WHERE
condition) directly on the partitioning key (Date dimension
key) of the fact table.


Use Sliding Window technique to maintain
data
• Maintain a rolling time window for online access to the
fact tables. Load newest data, unload oldest data.
•
• Always keep empty partitions at both ends of the partition
range to guarantee that the partition split (before loading
new data) and partition merge (after unloading old data)
do not incur any data movement.

• Avoid split or merge of populated partitions. Splitting or
merging populated partitions can be extremely inefficient,
as this may cause as much as 4 times more log
generation, and also cause severe locking.

data
• Create the load staging table in the same filegroup as the
partition you are loading.

• Create the unload staging table in the same filegroup as
the partition you are deleteing.

• It is fastest to load newest full partition at one time, but
only possible when partition size is equal to the data load
frequency (for example, you have one partition per day,
and you load data once per day).


data
• If the partition size doesn't match the data load frequency,
incrementally load the latest partition.

• Various options for loading bulk data into a partitioned
table are discussed in the whitepaper
• http://www.microsoft.com/technet/prodtechnol/sql/be
stpractice/loading_bulk_data_partitioned_table.mspx.

• Always unload one partition at a time.


Efficiently load the initial data
• Use SIMPLE or BULK LOGGED recovery model during
the initial data load.

• Create the partitioned fact table with the Clustered index.

• Create non-indexed staging tables for each partition, and
separate source data files for populating each partition.

• Populate the staging tables in parallel.

• Use multiple BULK INSERT, BCP or SSIS tasks.


• Create as many load scripts to run in parallel as there are
CPUs, if there is no IO bottleneck. If IO bandwidth is
limited, use fewer scripts in parallel.

• Use 0 batch size in the load. Use 0 commit size in the
load.

• Use TABLOCK.

• Use BULK INSERT if the sources are flat files on the
same server. Use BCP or SSIS if data is being pushed
from remote machines.


• Build a clustered index on each staging table, then create
appropriate CHECK constraints.

• SWITCH all partitions into the partitioned table.

• Build nonclustered indexes on the partitioned table.

• Possible to load 1 TB in under an hour on a 64-CPU
server with a SAN capable of 14 GB/Sec throughput (non-
indexed table).


Efficiently delete old data
• Use partition switching whenever possible.

• To delete millions of rows from nonpartitioned, indexed
tables
• Avoid DELETE FROM ...WHERE ...
• Huge locking and logging issues
• Long rollback if the delete is canceled

• Usually faster to
• INSERT the records to keep into a non-indexed table
• Create index(es) on the table
• Rename the new table to replace the original


Efficiently delete old data
• As an alternative, ‘trickle' deletes using the following
repeatedly in a loop

DELETE TOP (1000) ... ;
COMMIT

• Another alternative is to update the row to mark as
deleted, then delete later during non critical time.


Manage statistics manually
• Statistics on partitioned tables are maintained for the table
as a whole.

• Manually update statistics on large fact tables after
loading new data.

• Manually update statistics after rebuilding index on a
partition.

• If you regularly update statistics after periodic loads, you
may turn off autostats on that table.


Manage statistics manually
• This is important for optimizing queries that may need to
read only the newest data.

• Updating statistics on small dimension tables after
incremental loads may also help performance.

• Use FULLSCAN option on update statistics on dimension
tables for more accurate query plans.


Consider efficient backup strategies
• Backing up the entire database may take significant
amount of time for a very large database.

• For example, backing up a 2 TB database to a 10-spindle RAID-5
disk on a SAN may take 2 hours (at the rate 275 MB/sec).

• Snapshot backup using SAN technology is a very good
option.

• Reduce the volume of data to backup regularly.


Consider efficient backup strategies
• The filegroups for the historical partitions can be marked
as READ ONLY.

• Perform a filegroup backup once when a filegroup
becomes read-only.

• Perform regular backups only on the read / write
filegroups.

• Note that RESTOREs of the read-only filegroups cannot
be performed in parallel.


Data Warehouse Best Practices

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Warehouse Best Practices

Semelhante a Data Warehouse Best Practices (20)

Mais de Eduardo Castro

Mais de Eduardo Castro (20)

Último

Último (17)

Data Warehouse Best Practices