5. Hardware
SQL Server Fast Track Data Warehouse
www.microsoft.com/sqlserver/2008/en/us/fasttrack.aspx
Pre-tested hardware configurations
Specific disk, filegroup, layouts
Minimal indexing
To feed CPU at maximum capacity
6. Dimensions vs Facts
Dimension
Small (relatively)
Repeating data
Fact
Large
Numeric data + keys
Treat them differently
7. Dimensions in Relational Terms
Customer
Table structure
Full Name
Keys Post Code
City
Indexes State
Country
Null handling Gender
Occupation
Managing change Customer
Marital Status
Geography
Email Address
Processing
1. Country
2. State
3. City
4. Post Code
5. Full Name
8. Star vs. Snowflake Schemas
dbo.Customer
dbo.Customer CustomerKey
CustomerKey GeographyKey
FullName FullName
PostCode Gender
City Occupation
State MaritalStatus
OR
Country EmailAddress
Gender
Occupation
MaritalStatus dbo.Geography
EmailAddress
GeographyKey
PostCode
City
NB: both are denormalized, State
one more than the other Country
9. Primary Keys
Use smallest possible integer as surrogate
primary key
Primary key is a “row identifier”
Multiple row “versions” are possible
“None” and “Unknown” special values are useful
Do NOT use business/source system keys
Clustered primary key is OK for dimensions
10. Dimension Indexes
Dimension processing queries of the form:
SELECT DISTINCT .... FROM ....
WHERE (filter) clauses never used
WHERE (join) clauses are used in snowflake
dimensions
Non-processing queries may end up in SQL
ROLAP dimensions
Direct to SQL queries
11. Null Handling in Dimensions
By default NULL converts to 0 or an empty
string
NULL attribute keys can invoke special
“Unknown Member” handling
Prefer to create a specific “Unknown” row
CustomerKey FullName City Country
-1 Unknown Unknown Unknown
-2 None None None
1243 John Smith London United Kingdom
1244 Mary Jones Glasgow United Kingdom
12. Dimension Attributes
Attributes have keys, names (and values)
Integer attribute keys are smaller and faster
Keys must be unique
Attribute Key Name (Value)
Year 2009 CY 2009 2009
Month 4 April 4
Month of Year 20090400 April 2009 4
SELECT [Month] as [Month],
[Month] + „ „ + [Year] as [Month of Year]
FROM dbo.Time
13. Slowly Changing Dimensions
PK = row identifier dbo.Customer
CustomerKey
Multiple rows = FullName
multiple versions PostCode
City
State
Country
Add effective dating Gender
columns Occupation
MaritalStatus
Which can be exposed EmailAddress
as new dimensional
EffectiveFrom (smalldatetime)
attributes
EffectiveTo (smalldatetime)
CurrentFlag (tinyint)
14. Facts in Relational Terms
Keys
Internet Sales
Indexing Sales Amount
Order Quantity
Partitioning Tax Amount
Unit Price
Processing Transaction Count
Consider Row and Page compression
15. Fact Keys and Indexes
Is a surrogate/primary key required?
Beware the clustered index/primary key
Prefer the date FK as the clustered index
Add NO CHECK to foreign keys
Indexes are usually not useful
Unless processing degenerate dimensions
Or servicing ROLAP/direct to SQL queries
16. Fact Partitioning – Why?
Parallel processing
Only process most recent data
Multiple storage engine threads during query
Archive off data
Multiple aggregation strategies
NB: Partitions require Enterprise Edition
17. Fact Partitioning – Guidelines
Partition when fact tables are 50-100GB+
Ideal partition size 2M-20M rows
Less than 1000 partitions per measure group
This wins over partition size
Prefer to partition over time
Can not aggregate higher than partition grain
Align AS and SQL partitions!
Calculated time keys become very useful
18. Fact Storage
MOLAP, ROLAP or HOLAP
Source Data Facts Aggregations
Relational Multidimensional
19. Proactive Caching
Cube = “Cache”
Automatic invalidation of cube
Automatic rebuild of cube
Query
SQL Query Valid? Valid?
20. Quick Storage Engine Tuning
Ensure attribute relations are implemented
Turn on query log
Run Usage Based Optimisation (UBO) wizard