Machine Learning Model Validation (Aijun Zhang 2024).pdf
L16 l17 Data Warehousing
1. Rushdi Shams, Dept of CSE, KUET
Database SystemsDatabase Systems
Data WarehousingData Warehousing
Version 1.0Version 1.0
1
2. Rushdi Shams, Dept of CSE, KUET
The Advent of Data WarehousingThe Advent of Data Warehousing
The existing database models were notThe existing database models were not
suitable to meet the requirements.suitable to meet the requirements.
The requirements can be categorized intoThe requirements can be categorized into
two-two-
1.1. Operational UseOperational Use
2.2. Decision Support UseDecision Support Use
2
3. Rushdi Shams, Dept of CSE, KUET
Operational UseOperational Use
Requires precise, accurate, andRequires precise, accurate, and instantinstant picture ofpicture of
databasedatabase
Day to day basis business-Day to day basis business-
1.1. Customer comesCustomer comes
2.2. Orders partsOrders parts
1.1. Search the partsSearch the parts
2.2. Book/purchase the partsBook/purchase the parts
3.3. Add datesAdd dates
1.1. Bank transactions on the purchase/ bookingBank transactions on the purchase/ booking
2.2. InvoiceInvoice
3
4. Rushdi Shams, Dept of CSE, KUET
Operational UseOperational Use
Customer-company direct interactionCustomer-company direct interaction
All the information are processedAll the information are processed
instantaneously (or almost instantaneously)instantaneously (or almost instantaneously)
4
5. Rushdi Shams, Dept of CSE, KUET
Decision Support UseDecision Support Use
Operational use magnifies the scope-Operational use magnifies the scope-
Which customer, where he lives, what is hisWhich customer, where he lives, what is his
phone number, which part he bought, howphone number, which part he bought, how
much he paid, what was the date, bla bla bla…much he paid, what was the date, bla bla bla…
Decision support use narrows the scope-Decision support use narrows the scope-
I need only the business related issues- whichI need only the business related issues- which
customer, which part he bought, how muchcustomer, which part he bought, how much
he paid and what was the datehe paid and what was the date
5
6. Rushdi Shams, Dept of CSE, KUET
Decision Support UseDecision Support Use
…… & the benefits are-& the benefits are-
In december, the company may need to stockIn december, the company may need to stock
DDR RAM more than HDDDDR RAM more than HDD
SATA HDDs are more sold than PATA HDDsSATA HDDs are more sold than PATA HDDs
Mr. X is our honourable customer who boughtMr. X is our honourable customer who bought
most of the RAMs and Mr. Y is our honourablemost of the RAMs and Mr. Y is our honourable
customer who bought most of the SATA HDDscustomer who bought most of the SATA HDDs
6
7. Rushdi Shams, Dept of CSE, KUET
And The War Begins…And The War Begins…
So, the conflict between lightspeedSo, the conflict between lightspeed
applications (OLTP) and slog futureapplications (OLTP) and slog future
predictions led an advent of datapredictions led an advent of data
warehousing.warehousing.
7
8. Rushdi Shams, Dept of CSE, KUET
Relational DatabasesRelational Databases
Too granular, too many little piecesToo granular, too many little pieces
Processing takes longer time for largerProcessing takes longer time for larger
transactions by joining those little piecestransactions by joining those little pieces
Very effective for Front End applications thatVery effective for Front End applications that
are accessed by too many people tooare accessed by too many people too
frequentlyfrequently
Requires less hardware specificationRequires less hardware specification
8
9. Rushdi Shams, Dept of CSE, KUET
Data warehousingData warehousing
Processes large amount of informationProcesses large amount of information
Too less users (basically the owners)Too less users (basically the owners)
Mainly for reporting and analysisMainly for reporting and analysis
Hardware requirements are hugeHardware requirements are huge
9
10. Rushdi Shams, Dept of CSE, KUET
The relation between themThe relation between them
Data Warehousing is simplest form ofData Warehousing is simplest form of
relational databaserelational database
Try to only add data and remove data…Try to only add data and remove data…
because most often changing requires hugebecause most often changing requires huge
data processingdata processing
And you often do mistake in Keys for just twoAnd you often do mistake in Keys for just two
records, in this case you are dealing withrecords, in this case you are dealing with
millions of records- so, think about datamillions of records- so, think about data
modificationsmodifications
10
11. Rushdi Shams, Dept of CSE, KUET
The relation between themThe relation between them
The one-many / many-many / many-oneThe one-many / many-many / many-one
relations and key constraints of relationalrelations and key constraints of relational
model is still present in data warehousingmodel is still present in data warehousing
11
12. Rushdi Shams, Dept of CSE, KUET
The Dimensional Data ModelThe Dimensional Data Model
So, if data warehouse needs a different dataSo, if data warehouse needs a different data
model rather than relational model, what thatmodel rather than relational model, what that
would be?would be?
The answer is dimensional data modelThe answer is dimensional data model
12
13. Rushdi Shams, Dept of CSE, KUET
The Dimensional Data ModelThe Dimensional Data Model
Contains-Contains-
1.1. FactsFacts
2.2. DimensionsDimensions
Fact table contains transactions. ForFact table contains transactions. For
example, invoices of all the customers forexample, invoices of all the customers for
the last 5 years.the last 5 years.
The dimension tables describe the fact table.The dimension tables describe the fact table.
13
14. Rushdi Shams, Dept of CSE, KUET
The Dimensional Data ModelThe Dimensional Data Model
Static Data
Dynamic Data
14
15. Rushdi Shams, Dept of CSE, KUET
The Star SchemaThe Star Schema
The most effective approach to model dataThe most effective approach to model data
using dimensional data model is theusing dimensional data model is the StarStar
SchemaSchema
15
17. Rushdi Shams, Dept of CSE, KUET
The Star Schema: Equivalent DiagramThe Star Schema: Equivalent Diagram
17
18. Rushdi Shams, Dept of CSE, KUET
The Star Schema: PropertiesThe Star Schema: Properties
So, a star schema contains a fact table- whichSo, a star schema contains a fact table- which
is robust as the time goes by, very dynamic,is robust as the time goes by, very dynamic,
changes all the timechanges all the time
A star schema contains dimension tables-A star schema contains dimension tables-
which are static, changes very little as thewhich are static, changes very little as the
time goes bytime goes by
Star schema aids queries to join a bulky factStar schema aids queries to join a bulky fact
table with dimension tables to be simple andtable with dimension tables to be simple and
not time complexnot time complex
18
19. Rushdi Shams, Dept of CSE, KUET
The Snowflake SchemaThe Snowflake Schema
Normalized star schemaNormalized star schema
Only the dimensions are normalizedOnly the dimensions are normalized
The result is a fact table connected directlyThe result is a fact table connected directly
with some dimension tables and somewith some dimension tables and some
dimension tables connected to otherdimension tables connected to other
dimension tablesdimension tables
19
20. Rushdi Shams, Dept of CSE, KUET
The Snowflake SchemaThe Snowflake Schema
Fact Table
Normalized
Dimension
Dimension
20
21. Rushdi Shams, Dept of CSE, KUET
The Snowflake Schema: EquivalentThe Snowflake Schema: Equivalent
ViewView
21
22. Rushdi Shams, Dept of CSE, KUET
The ProblemThe Problem
Not too many tables but too many layersNot too many tables but too many layers
The most used Relational algebra inThe most used Relational algebra in
dimensional database isdimensional database is JoinJoin
Too many tables in joins, too many overheads.Too many tables in joins, too many overheads.
There are not many tables hereThere are not many tables here
But too many layers, joining one tableBut too many layers, joining one table
requires joining other related tablesrequires joining other related tables
And if one of those tables (Fact) have trillionsAnd if one of those tables (Fact) have trillions
of data, you are dead!of data, you are dead!
22
23. Rushdi Shams, Dept of CSE, KUET
The ProblemThe Problem
If the SALE fact table has 1 million records, and all
dimensions contain 10 records each, a Cartesian
product would return 106
multiplied by 109
records.
That makes for 1015
records
23
24. Rushdi Shams, Dept of CSE, KUET
The SolutionThe Solution
Convert the snowflake schema into starConvert the snowflake schema into star
schema.schema.
24
26. Rushdi Shams, Dept of CSE, KUET
The SolutionThe Solution
a join occurs between one fact table and six
dimensional tables. That is a Cartesian product of 106
multiple by 106
, resulting in 1012
records returned.
26
27. Rushdi Shams, Dept of CSE, KUET
The DifferenceThe Difference
The difference between 1012
and 1015
is three
decimals.
Three decimals is not just three zeroes and
thus 1,000 records. The difference is actually
1,000,000,000,000,000 – 1,000,000,000,000 =
999,000,000,000,000.
27
28. Rushdi Shams, Dept of CSE, KUET
Types of Dimension TablesTypes of Dimension Tables
Dimension tables showed so far areDimension tables showed so far are
inadequateinadequate
Typically, there are some conventions forTypically, there are some conventions for
dimension tables.dimension tables.
Such as dates and locations are two commonSuch as dates and locations are two common
dimension tables in data warehouses.dimension tables in data warehouses.
Why?? Most businesses have two commonWhy?? Most businesses have two common
issues- date of a transaction, place ofissues- date of a transaction, place of
shipment/ deliveryshipment/ delivery
28
29. Rushdi Shams, Dept of CSE, KUET
Types of Dimension Tables: DatesTypes of Dimension Tables: Dates
29
30. Rushdi Shams, Dept of CSE, KUET
Types of Dimension Tables: DatesTypes of Dimension Tables: Dates
30
31. Rushdi Shams, Dept of CSE, KUET
Types of Dimension Tables: LocationsTypes of Dimension Tables: Locations
Locations, states, country, continent, etcLocations, states, country, continent, etc
31
32. Rushdi Shams, Dept of CSE, KUET
Let’s Create a DataLet’s Create a Data
Warehouse ModelWarehouse Model
32
33. Rushdi Shams, Dept of CSE, KUET
The Relational ModelThe Relational Model
33
34. Rushdi Shams, Dept of CSE, KUET
Step 1Step 1
Identify the Fact tableIdentify the Fact table
The Fact table contains (mostly) transactionsThe Fact table contains (mostly) transactions
that occur day-to-day basis/ that are relatedthat occur day-to-day basis/ that are related
with money/ anything that is the mainwith money/ anything that is the main
purpose of a businesspurpose of a business
34
35. Rushdi Shams, Dept of CSE, KUET
Step 1: Finding the Fact TableStep 1: Finding the Fact Table
35
36. Rushdi Shams, Dept of CSE, KUET
Step 1Step 1
So, our fact table would be (in this case)So, our fact table would be (in this case)
RoyaltyRoyalty
36
37. Rushdi Shams, Dept of CSE, KUET
Step 2: Find Dimension TablesStep 2: Find Dimension Tables
Find the tables that are static, not dynamic…Find the tables that are static, not dynamic…
dynamic one is the Fact table.dynamic one is the Fact table.
We will take a look at both the staticWe will take a look at both the static
(dimension) tables and dynamic (fact) tables(dimension) tables and dynamic (fact) tables
when we will finish step 3when we will finish step 3
37
38. Rushdi Shams, Dept of CSE, KUET
Step 3Step 3
Develop a snowflake schema with the fact andDevelop a snowflake schema with the fact and
dimension tablesdimension tables
38
41. Rushdi Shams, Dept of CSE, KUET
Step 4Step 4
Develop a star schema by denormalizing theDevelop a star schema by denormalizing the
snowflake schemasnowflake schema
41
42. Rushdi Shams, Dept of CSE, KUET
Step 4: Star SchemaStep 4: Star Schema
42
43. Rushdi Shams, Dept of CSE, KUET
Step 4: Star SchemaStep 4: Star Schema
43
44. Rushdi Shams, Dept of CSE, KUET
Surrogate Key: Important Key in DataSurrogate Key: Important Key in Data
WarehouseWarehouse
A customer is recognized in table 1 by customerA customer is recognized in table 1 by customer
namename
The same person in table 2 is recognized byThe same person in table 2 is recognized by
telephone numbertelephone number
The same person in table 3 is recognized by SSNThe same person in table 3 is recognized by SSN
numbernumber
If you have to make table 1, 2, 3 as dimension tables,If you have to make table 1, 2, 3 as dimension tables,
then the fact table will not be able to recognize thethen the fact table will not be able to recognize the
same person having 3 foreign keys from those tablessame person having 3 foreign keys from those tables
44
45. Rushdi Shams, Dept of CSE, KUET
Surrogate Key: Important Key in DataSurrogate Key: Important Key in Data
WarehouseWarehouse
45
46. Rushdi Shams, Dept of CSE, KUET
Understanding the Fact TableUnderstanding the Fact Table
Facts are numeric valuesFacts are numeric values
Facts are not the foreign key fieldsFacts are not the foreign key fields
The foreign keys are used to provide moreThe foreign keys are used to provide more
detail with a fact- which are the main focus ofdetail with a fact- which are the main focus of
the businessthe business
46
47. Rushdi Shams, Dept of CSE, KUET
ReferenceReference
Beginning Database Design by GavinBeginning Database Design by Gavin
Powell, Wrox Publications, 2005Powell, Wrox Publications, 2005
47