2. Introduction
• Who am I?
– Michael Wacey
– Partner with CSC since 1986
– Architected many large scale data warehouses
• What are we going to discuss today?
– Motivation
– Tools
– Approach
PAGE 2
3. Motivation
• Data Here, Data There, Data Everywhere
• Solutions
– Architecture – the SAP approach – very hard to sustain and SAP can not solve all
problems
– Data Integration – requires architecture on the boundaries and infrastructure, lots of
infrastructure
– Data Warehouse – Periodically collect the data and bring it all together for one or
more purposes – the best bet for the foreseeable future
• Solutions are always trying to answer - How do we get this data to
fit together?
PAGE 3
4. Motivation
• Making data fit together is difficult
– Local countries report numbers in their local (possibly multiple) currencies and there
is no agreed to set of conversion rates
– The Trust department would rather not share that data with finance
– The current policy administration system has serious data quality issues, but there is
a new system being built and scheduled to go online in June 2011, but that date may
be in jeopardy
• We need a way to collect and analyze all this knowledge about the
data
PAGE 4
5. Motivation
• A high level view: Customer
Profitability
Accounting
Sales Data Warehouse Sales
Forecasts
Marketing
• May help with scoping
• Each line could represent many files or feeds
• Each box could represent many applications
PAGE 5
6. Motivation
• A detailed view:
BEGIN
SELECT ml.sequence, al.sequence, m.msgkey INTO mseq, aseq, mkey
FROM mqseries.levelcodes ml, mqseries.messages m, mqseries.appctl a, mqseries.levelcodes al
WHERE m.msglevel = ml.levelcodekey
AND m.msgcode = inmsgcode
AND a.msglevel = al.levelcodekey
AND a.appctlkey = 1;
IF sql%ROWCOUNT = 1 THEN
IF aseq <= mseq THEN
SELECT statuscodekey INTO sck FROM mqseries.statuscodes WHERE
statuscode = 'n';
insert into mqseries.msglog (msglogkey, msgkey, msgdata, msgstatus,
msgsqlcode, msgsqlerrm)
values(mqseries.msgseq.nextval, mkey, inmsgdata, sck, inmsgsqlcode,
SUBSTR(inmsgsqlerrm,1,4000));
IF incommit = true THEN
commit;
END IF;
END IF;
ELSE
• Too much detail to plan and analyze and understand
• As usual, we have a forest and trees problem
PAGE 6
7. Motivation
• What to do?
– PowerPoint?
– Visio?
– ERwin?
• They all help, but none gives us that right picture
• We need a way to see the problem and the solution at the right
level of detail
PAGE 7
8. Motivation
• What is a data warehouse?
• It includes:
– Sources of data
– Processing of data
– Storage of data – probably multiple times in different structures
– Analytics
• Except for Analytics, these are either static views of data or
dynamic processing of data
• ERwin DM is great for the static views of data, we just need to
capture the dynamic processing
PAGE 8
9. Motivation
• I have used many techniques to capture the dynamic processing
• Spreadsheets to capture data mapping (who hasn’t)
• Process flow diagrams in PowerPoint and Visio
• UML Diagrams in the IBM and Sparx tools
• They all worked to an extent but were hard to maintain and did not
provide a leveling mechanism
PAGE 9
10. Motivation
• Many years ago, I had used Data Flow Diagrams to describe
systems under development
• They provided insight into the flow of data and leveling of those
processes
• So, I tried that – first in Visio and later in ERwin PM
• The rest of this talk is an approach to using ERwin DM and ERwin
PM together to model a Data Warehouse
• I have used this approach for the past five years and find it is very
successful
• It provides information to both the user community and
developers
PAGE 10
11. The Tools
• ERwin Data Modeler
– Used to model databases
– Supports both Logical and Physical models
– If needed, I create conceptual models in PowerPoint or Visio
– Each model has to represent one type of database
– But, data warehouses use many – Flat Files, Oracle, SQL Server, Cubes, etc
– I use UDP to represent the actual type of an Entity/Table
– For example, a table that represents a flat file would have that setting in a UDP
PAGE 11
12. The Tools
• ERwin Process Modeler (ERwin PM)
– Previously called BPwin
– Supports several diagram types
– I have only found the Data Flow diagrams useful for the design of a data warehouse
– The other diagrams could be used in analysis to understand how the data warehouse
will be used
PAGE 12
13. The Tools
• ERwin DM and ERwin PM
• There is a connection between the tools
• I have not used it extensively
PAGE 13
14. The Tools
• Other Tools
– These are minor but needed
– PDF Viewer
– Microsoft Excel
– Microsoft Word
PAGE 14
15. The Approach
• So, we have two tools to design a data warehouse
• ERwin DM will be used to design and document static data stores
• ERwin PM will be used to design the processing
• Lets take a look at an example and then discuss how it works
PAGE 15
16. The Approach
• Start in ERwin PM
• Create a new model that is a data flow model
• First we will create a context model
• This will provide a view of the sources and uses of data
• On the left side, the sources of data are listed – using the external
entity symbol
– Sources can be Systems, Databases, People, etc.
• On the right hand side, the uses of data are listed – using the
external entity symbol
– Uses can be reports, cubes, analytics, data feeds, etc.
PAGE 16
17. The Approach
E1 Allocation Exception E11
Allocation Factors Report Exception
Factors Data Report
Demand
E2 Deposit D ata E12
Demand Deposit $0 A0
Balancing
Accounts Report
Cons um er Loan Balancing Report Data
E3 Data
Cons um er Loans
E4 Mortgage Data
Mortgages E13
Cus tomer Profitability Comm ercial
Comm ercial Loan Cus tomer
E5 Data Comm ercial
Analytics
Comm erical Loans Cus tomer Data
Treas ury Data
E6
Treas ury E14
Retail
Retail Cus tomer Cus tomer
E7 Data Analytics
Trus t Accounts Trus t Data
E8 Organization Data
Organization
E9
General Ledger
General Ledger
Data
NODE: TITLE: NUMBER:
Customer Profitability
A-0
PAGE 17
18. The Approach
• The Context Diagram is a good start
• It sets the scope
• But does not provide any details about what is going to be done
• This comes in the next diagram – The details of the central process
PAGE 18
19. The Approach
$0 A3
D3 Exceptions Exception Report
Exception Output Data
Source Exceptions
Allocation
Factors $0 A1 Calculation
Exceptions
Cus tomer Cus tomer
Comm ercial Profitability
Loan Data Validated $0 A2 Profitability D2
Data Data
Dim ens ion Warehous e
Cus tomer Profitability Calculation
Data
Dim ens ion
Mortgage Data Data for Comm ercial
Cus tomer Calculation Retial BI Data BI Data
Cons um er D1 Profitability
Loan Data Validated Staging Fact Data $0 A4
Fact Data for
Demand Sourcing
Calculation
Deposit D ata
Comm ercial
Calculation Balance Comm ercial BI Cus tomer Data
Organization Data Values
Trus t Data Comm ercial
Balancing Data
Treas ury Data $0 A5
General Ledger
Data Retail Cus tomer
Retail Balancing Data Retail BI Data
Input Balance
Values $0 A6
Balancing Balancing
D4
Values Balance Input and Output Report Data
NODE: TITLE: NUMBER:
Customer Profitability
A0
PAGE 19
20. The Approach
• This level one diagram shows all the key components of the
solution.
• There is no magic formula of should be included here
• There needs to at least be some sort of sourcing, processing, and
display/output activities
• In this case, there one source processing, one calculation, and four
output activities
• Each can be broken down into more details
• Lets look at the Commercial BI Activity
PAGE 20
21. The Approach
Data for
Comm ercial$0 A4.1 Cube
BI Data Out Comm ercial Data for C ube $0 A4.3 $0 A4.6
Load Commercial Cube D16 Profitability In Data for
Cube
Cube Provider Reporting Comm erical Profitability Reporting
Comm ercial
Balancing Data Comm ercial
Cus tomer Data
NODE: TITLE: NUMBER:
Commercial BI
A4
PAGE 21
22. The Approach
• This decomposition can continue until you are comfortable
• I try to get to the point where one developer can implement it in
one module
• At this point, we will have a series of diagrams that show the flow
of data through the system
• The diagrams contain:
– Activities
– Data Stores (note that a single data store can be used on multiple diagrams)
– Data Flows
– External Entities
PAGE 22
23. The Approach
• Each of the diagram elements, except for the Data Flows, can be
further modeled in ERwin DM
• This gives the developer a further level of detail of what is intended
• It also provides the physical names that will be used
• To maintain the mapping between the models, I use a naming
convention for ERwin DM Subject Areas
• The convention is:
– A01.01.01 – {Activity Name}
– D01 – {Data Store Name}
– E01 – {External Name}
PAGE 23
24. The Approach
• Some examples for External Entities and Data Stores from the
model above:
– D01 – Customer Profitability Staging
– E05 – Commercial Loans
• Each of these subject areas should have the portion of the data
model relevant to it
• Note that these are just typical ER models
• They can represent more than just table – for example, an external
entity could be a flat file
• Below is an example – the E05 – Commercial Loans external entity
PAGE 24
26. The Approach
• Next we need to look at the activities
• Because activities have a hierarchical numbering system, we need
one for the subject areas
• We simply start with A and separate each level with a period
• Combine Retail Loans from the model above is in Activity 7 inside
of Activity 2. It is called A2.7 Combine Retail Loans in the model.
• The associated subject area will be:
– A02.07 – Combine Retail Loans
• The data model will show the input and out put entities and how
they are processed
PAGE 26
28. The Approach
• With the Diagrams from ERwin DM, ERwin PM, and the narrative in
ERwin PM, the developer has all the information they need to
implement a portion of the solution
• The diagrams and narratives are also accessible to technical users
• Twice, I have had the user community write papers to explain the
details of specific areas of the ERwin PM model
PAGE 28
29. The Approach
• Notes
– Using ERwin DM we can quickly build detailed reports with diagrams and
descriptions
– The developers use these reports to track what they have to do
– The Project Managers use these reports as an inventory for project planning
– The ERwin PM reports are like a roadmap that ties everything together
– It takes some effort to keep everything synchronized but it is well worth it
PAGE 29
30. The Approach
• In Summary
– A data warehouse is very much a store of data and a flow of data
– ERwin DM and ERwin PM can model both of these areas
– Use ERwin PM to decompose the solution
• There is no right or best decomposition
• Try it until it works
– Use ERwin DM to model the internals of External Entities, Data Stores, and Activities
• Tie the two models together through an appropriate naming convention
• Do not worry if the entities model more than tables
– The goal is to communicate with users and developers
PAGE 30