SlideShare uma empresa Scribd logo
1 de 13
DecisionLab.Net
business intelligence is business performance
___________________________________________________________________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________________________________________________________________________
DecisionLab http://www.decisionlab.net dupton@decisionlab.net direct760.525.3268
http://blog.decisionlab.net Carlsbad,California,USA
Data Vault:
Data Warehouse
Design Goes
Agile
__________________________________________________________________________________________________________________________________________________________________________________
Page 2 of 13
Whitepaper
Data Vault:
Data Warehouse Design Goes Agile
by daniel upton
data warehouse modeler and architect
certified scrum master
DecisionLab.Net
business intelligence is business performance
dupton@decisionlab.net
http://www.linkedin.com/in/DanielUpton
Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this written material is in the form of a
review or a brief reference to a specific concept herein, either or which must clearly specify this writing’s title, author (me), and this web address
http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For permission to reproduce or copy any of this material other than what is
specified above, just email me at the above address.
__________________________________________________________________________________________________________________________________________________________________________________
Page 3 of 13
Open Question: When we begin considering a new Data Warehouse initiative, how clear is the
scope, really?
If weintend to design Data Marts, and we have no specified need for a data warehouseeither to become a systemof record,
or to supportMaster Data Management (MDM), then we may chooseto Dr. Ralph Kimball’s Data WarehouseBus
architecture, designing a library of conformed (standardized, re-usable) dimension and fact tables for deployment into a series
of purpose-builtdata marts. Under these requirements, wemay have no specific need for an Inmon stylethird-normalform
(3nf) EnterpriseData Warehouse(EDW) in general, or for a Data Vault in particular. In other cases, however, because
sometimes data warehousedata outlives its corresponding sourcedata inside a soon-to-retireapplication database, then, like
it or not, a data warehousemay, as Bill Inman remind us, assumea systemof record role for its data. Whereas the Kimball
Bus architecture’s tables are often not related via key fields, and in fact may not be populated at all until deployment fromthe
Bus into a specific-needs Data Mart, Kimball adherents rarely asserta system-of-record rolefor their solutions.
But, supposewedo determine that our required solution either does need to assumea systemof record role, or perhaps that
it mustsupportMaster Data Management. As such, wemay elect to design a fully functionalEDW, rather than Kimball’s DW
Bus, so that the EDW itself, and not justits dependent data marts, is a working, populated database. Now, knowing that the
creation of a classic EDW, with its requirement for an up-front, enterprise-widedesign, is a challenge with today’s
expectations for rapid delivery, some may be curious aboutnew design methodologies offer ways to accelerate EDW Design.
Data Vault, a data warehousemodeling method with a substantialfollowing in Denmark, and a growing basein the U.S., offers
specific and important benefits.
In order to set expectations early about Data Vault, readers mustunderstand that, somewhatunlike a traditional EDW, and
utterly unlike a star-schema, a Data Vault (not to be confusedwithBusiness DataVault, whichis not addressedinthis
article) cannot serve as an efficient presentationlayer appropriate for direct queries. Rather, it is morelike a historic
enterprise data staging repository that, with additional downstreamETL, will supportnotonly star-schema, reporting and data
mining, but also master data management, data quality and other enterprise data initiatives.
__________________________________________________________________________________________________________________________________________________________________________________
Page 4 of 13
Data Vault Benefits:
 Benefit #1: Allows for loading of a history-tracking DW with little or none of the typical extraction, transformation and
loading (ETL) transformations that, oncethey are finally figured out, would otherwisecontain subjective-interpretations
of the data and which purportedly enhancethe data and prepareit for reporting or analytics.
o In my view, this is almost enough of a benefit all by itself. As such, in my introduction that follows, I will focus on
proving this point.
o Agile Win: Confidently loading a DW without having to already know the fine details of business rules and
requirements and the resulting transformation requirements means that loading of historicaland incremental
data could get accomplished before the firsttarget databasedesign (3nf EDW or Data Mart) is complete.
 Benefit #2: Insofar as Data Vaultprescribes a very generic downstream‘de-constructing’ of OLTP tables, thesede-
constructing transformations can beautomated and so can it’s associated early-stageETL into Data Vault. Since, as
you’ll soon see, Data Vault causes a substantial increasein the number of tables, this automation potential is a
substantialbenefit.
o Agile Win: Automated initial design and loading, anyone?
 Benefit #3: Due to Data Vault’s generic design logic, it’s use of surrogatekeys (moreon this soon), and it’s prescription
to avoid subjective-interpretivetransformations, it’s reasonableto quickly load a Data Vaultjustwith the needed subset
of tables.
o Agile Win: More frequent releases. Quickly design for, and load, only the data needed for the next release. Use
the samegeneric design to load other tables when those User Stories fromthe ProductBacklog get placed into a
Sprint.
In the remainder of this article, I will provide a high level introduction to Data Vault, with primary emphasis on how it achieves
Benefit #1.
__________________________________________________________________________________________________________________________________________________________________________________
Page 5 of 13
High-Level IntroductiontoDataVault Methodology:
We begin with a simple OLTP databasedesign for clients purchasing products froma company’s stores. For simplicity, I
include only a minimum of fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to DiagramA
below.
As is common, this simple OLTP schema does not use surrogatekeys. If a client gets a new email address, or a productgets a
new name, or a city’s re-mapping of boundary lines suddenly places an existing storein a new city, new values would
overwritethe old values, which would then be lost. Of course, in order to preservehistory, history-tracking surrogatekeys are
commonly used by practitioners of both Bill Inmon’s classic third-normalform(3nf) EDW design, as well as Dr. Ralph Kimball’s
Star Schema method, but both of these methods prescribesurrogatekeys within the context of data transformations thatalso
include subjectiveinterpretation (herein simply ‘subjectivetransformation’) in order to cleanse or purportedly enhance the
data for the purposes of integration, reporting, or analytics. Data Vault purists claim that any such subjectivetransformation
of line-of-business data introduces inappropriatedistortion to it, thereby disqualifying the Data Warehouseas systemof
record. Data Vault, importantly, provides a unique way to track historical changes in sourcedata while eliminating most, or
all, subjectivetransformations such as field renaming, selective data-quality filters, establishment of hierarchies, calculated
fields, and target values. Although analytics-driven, subjectivetransformations can still be applied, they are applied
downstreamof the Data Vault EDW, as subsequenttransformations for loads into data marts designed to analyze specific
processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing
approach that I will now describe. Before beginning, I recommend against too-quickly comparisons this method others, like
star-schema design, which servedifferent needs.
__________________________________________________________________________________________________________________________________________________________________________________
Page 6 of 13
DiagramA: Simple OLTP schema (data sourcefor a Data Vault)
__________________________________________________________________________________________________________________________________________________________________________________
Page 7 of 13
Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links. The diagram’s Client table as a good
example. Hubs work according to the following simplified description:
Hub Tables:
 Define the granularity of an entity (eg. product), and thus the granularity of non-key attributes (eg. productdescription)
within the entity.
 Contain a new surrogateprimary key (PK), as well as the sourcetable’s business key, which is demotes fromits PK role.
Satellite Tables:
 Contain all non-key fields (attributes), plus a set of date-stamp fields
 Contain, as a Foreign Key (FK), the Hub’s PK, plus the load date-time stamps.
 Have a defining, dependent entity relationship to one, and only one, parent table.
 Whether that parent table is a Hub or Link, the Satellite holds the non-key fields fromthe parenttable.
 Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key
attribute change(eg. a client’s email address changes) upstreamin the OLTP schema (often accomplished up there with
a simple over-write), a new row will be added only to the Satellite, and not the Hub, which is why many Satellite rows
relate to one Hub row. So, in this fashion, historic changes within sourcetables are gracefully tracked in the EDW.
Notice, in DiagramB that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that,
at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, thoserelationships will
appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps.
__________________________________________________________________________________________________________________________________________________________________________________
Page 8 of 13
DiagramB: Hubs and Satellite in a partially-designed Data Vault schema
__________________________________________________________________________________________________________________________________________________________________________________
Page 9 of 13
Link Tables:
 Refer to Diagram C
 Relate exactly two Hub tables together.
 Contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogatePK.
 As with an ordinary association table, a Link is a child to two other tables and, as such, is able to gracefully handle
relative changes in cardinality between the two tables and, wherenecessary, can directly resolvemany-to-many
relationships that might otherwisecausea show-stopper error in thedata-loading process.
 Unlike an ordinary associationtable, the Link table, with its own surrogatePK, is able to track historic changes in the
relationship itself between the two Hubs, and thus between their two directly-related OLTP sourcetables. Specifically,
all loaded data that conformed with the initial cardinality between tables would sharethe same Link table surrogate
key, but an unexpected, future sourcedata change that either caused a cardinality reversal(so that the one becomes
the many, and vice versa), a new row, with a new surrogatekey, is generated to not only capture it now while the
original surrogatekey preserves thehistorical relationship. Slick!
 In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and
load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table
might conceivably get date-time stamp fields so that, in coordination with its surrogatePK, an Order (perhaps for a
long-running service) that, after the Order Date, gets re-credited to a different storecan be efficiently tracked over time
in this way.
__________________________________________________________________________________________________________________________________________________________________________________
Page 10 of 13
DiagramC: Completed Data Vault Schema (Link tables added)
__________________________________________________________________________________________________________________________________________________________________________________
Page 11 of 13
Now, we’veadded Link tables. After scanning DiagramC, go back and compare it withDiagram A and note the movement of
the various non-key attributes. Undoubtedly, you will also notice, and may be concerned, that the sourceschema’s fivetables
justmorphed into the Data Vault’s twelve. Importantly, notethat the Diagram A’s Details table was transformed notinto a
Hub-and-Satellite combination, but rather into a Link table. When you consider that an order detail record (a line item) is
really justthe association between an Order and a Product(albeit an association with plenty of vital associated data), then it
makes sensethat the Link table Details_l was created. This Link table, whosesole purposeis to relate the Orders_h and
Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity
and Unit Price.
The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated
all subjectiveinterpretation!” Perhaps not, but whatI’ll describehere is a pretty small, generic interpretation. Either way, in
this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather
than the Details_l Link. Added to that, if we use very simple Data-Vaultdesign automation logic, which simply de-constructs
all tables into Hub and Satellite pairs, this is whatwe would get. However, keep in mind that if we did that, we would then
have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link
table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choosethe design
that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s
beyond the scopeof this article.
__________________________________________________________________________________________________________________________________________________________________________________
Page 12 of 13
Conclusion:
Our discussion on Data Vault opened with the idea that an EDW should load and storehistoricaldata withoutapplying any
transformations thatcontain subjectiveinterpretation of data or business-rules, becausethoseinterpretations, even if
appropriatefor specific reporting or analytics, do modify line-of-business data, and thereforeintroduce distortions into
operational data. Those interpretive transformations should occur downstreamduring ETL into presentation layer tables.
Although Data Vault does, in fact, apply a specific set of generic ‘de-construction’ transformations, thesetransformations
contain little or no subjective interpretation of business rules. They do, however, allow it to (1) apply an appropriatelevel of
referential integrity to sourcedata even wherethe sourcesystemmay lack it now or in the future; (2) gracefully capture
historical data changes, within and between tables, without endangering the success of the data load; (3) supportloading of
data froma subsetof sourcetables initially, and then load, or not load, other related sourcedata tables much later without
compromising the EDW’s referential integrity.
Lastly, and very importantly; (4) data vault design and the associated Data Vault loading ETL, which is largely generic from one
data set to another, can be automated, and thus radically accelerated in development. Although the logic of this automation
flows fromthe simplicity of data vault design, a detailed automation discussion is beyond the scope of this article.
In closing, if we can automatically design and load a Data Warehouse(albeit not it’s presentation layer), it frees up brain cells
for the higher-order logic of design of the presentation layer and the intensive, customETL to load it. As I described here, all
of this can be accomplished simultaneously.
________________________________________________
daniel upton
dupton@decisionlab.net
DecisionLab.Net
business intelligence is business performance
__________________________________________________________________________________________________________________________________________________________________________________
Page 13 of 13
DecisionLab.Net
Range of Services:
_____________________________________________________
Business Intelligence Roadmapping,Feasibility Analysis
BI ProjectEstimation and Requirement Modelstorming
BI Staff Augmentation: Data Warehouse / Mart / Dashboard Design and Development
_________________________________________________________________________________________________________________________________________________________________________
DanielUpton
DecisionLab http://www.decisionlab.net dupton@decisionlab.net
Direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA

Mais conteúdo relacionado

Mais procurados

Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceDATAVERSITY
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesCGI
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault ModelingKent Graziano
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata ManagementDATAVERSITY
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data SolutionJames Serra
 
DAS Slides: Master Data Management — Aligning Data, Process, and Governance
DAS Slides: Master Data Management — Aligning Data, Process, and GovernanceDAS Slides: Master Data Management — Aligning Data, Process, and Governance
DAS Slides: Master Data Management — Aligning Data, Process, and GovernanceDATAVERSITY
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodologyDatabase Architechs
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Rethinking Trust in Data
Rethinking Trust in Data Rethinking Trust in Data
Rethinking Trust in Data DATAVERSITY
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Lessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMLessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMDATAVERSITY
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 

Mais procurados (20)

Master Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and GovernanceMaster Data Management – Aligning Data, Process, and Governance
Master Data Management – Aligning Data, Process, and Governance
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best Practices
 
Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault Modeling
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Modeling & Metadata Management
Data Modeling & Metadata ManagementData Modeling & Metadata Management
Data Modeling & Metadata Management
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
DAS Slides: Master Data Management — Aligning Data, Process, and Governance
DAS Slides: Master Data Management — Aligning Data, Process, and GovernanceDAS Slides: Master Data Management — Aligning Data, Process, and Governance
DAS Slides: Master Data Management — Aligning Data, Process, and Governance
 
Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodology
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Operational Data Vault
Operational Data VaultOperational Data Vault
Operational Data Vault
 
Rethinking Trust in Data
Rethinking Trust in Data Rethinking Trust in Data
Rethinking Trust in Data
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Lessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMLessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDM
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 

Semelhante a Data Vault: Data Warehouse Design Goes Agile

Lean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultLean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultDaniel Upton
 
Data Vault: What is it? Where does it fit? SQL Saturday #249
Data Vault: What is it?  Where does it fit?  SQL Saturday #249Data Vault: What is it?  Where does it fit?  SQL Saturday #249
Data Vault: What is it? Where does it fit? SQL Saturday #249Daniel Upton
 
Rando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suiteRando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suiteCarlo Vaccari
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...IRJET Journal
 
Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)
Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)
Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)Daniel Upton
 
Building Modern Data Platform with AWS
Building Modern Data Platform with AWSBuilding Modern Data Platform with AWS
Building Modern Data Platform with AWSDmitry Anoshin
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designSarita Kataria
 
Multi dimensional modeling
Multi dimensional modelingMulti dimensional modeling
Multi dimensional modelingnoviari sugianto
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
Microsoft SQL Azure - Scaling Out with SQL Azure Whitepaper
Microsoft SQL Azure - Scaling Out with SQL Azure WhitepaperMicrosoft SQL Azure - Scaling Out with SQL Azure Whitepaper
Microsoft SQL Azure - Scaling Out with SQL Azure WhitepaperMicrosoft Private Cloud
 
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonEDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonDaniel Upton
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptPurnenduMaity2
 
Tips for managing a VLDB
Tips for managing a VLDBTips for managing a VLDB
Tips for managing a VLDBJohn Martin
 
DBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfDBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfJanakiramanS13
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecastVera Ekimenko
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyDonna Guazzaloca-Zehl
 
Data Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data WarehouseData Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data WarehouseDaniel Upton
 

Semelhante a Data Vault: Data Warehouse Design Goes Agile (20)

Lean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultLean Data Warehouse via Data Vault
Lean Data Warehouse via Data Vault
 
Data Vault: What is it? Where does it fit? SQL Saturday #249
Data Vault: What is it?  Where does it fit?  SQL Saturday #249Data Vault: What is it?  Where does it fit?  SQL Saturday #249
Data Vault: What is it? Where does it fit? SQL Saturday #249
 
Rando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suiteRando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suite
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
 
Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)
Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)
Enhancing Dashboard Visuals with Multi-Dimensional Expressions (MDX)
 
Building Modern Data Platform with AWS
Building Modern Data Platform with AWSBuilding Modern Data Platform with AWS
Building Modern Data Platform with AWS
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 
Multi dimensional modeling
Multi dimensional modelingMulti dimensional modeling
Multi dimensional modeling
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Microsoft SQL Azure - Scaling Out with SQL Azure Whitepaper
Microsoft SQL Azure - Scaling Out with SQL Azure WhitepaperMicrosoft SQL Azure - Scaling Out with SQL Azure Whitepaper
Microsoft SQL Azure - Scaling Out with SQL Azure Whitepaper
 
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonEDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.ppt
 
Tips for managing a VLDB
Tips for managing a VLDBTips for managing a VLDB
Tips for managing a VLDB
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
DBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfDBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdf
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecast
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
Data Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data WarehouseData Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data Warehouse
 

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Data Vault: Data Warehouse Design Goes Agile

  • 1. DecisionLab.Net business intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________ ____________________________________________________________________________________________________________________________________________________________________________________ DecisionLab http://www.decisionlab.net dupton@decisionlab.net direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA Data Vault: Data Warehouse Design Goes Agile
  • 2. __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 13 Whitepaper Data Vault: Data Warehouse Design Goes Agile by daniel upton data warehouse modeler and architect certified scrum master DecisionLab.Net business intelligence is business performance dupton@decisionlab.net http://www.linkedin.com/in/DanielUpton Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this written material is in the form of a review or a brief reference to a specific concept herein, either or which must clearly specify this writing’s title, author (me), and this web address http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For permission to reproduce or copy any of this material other than what is specified above, just email me at the above address.
  • 3. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 13 Open Question: When we begin considering a new Data Warehouse initiative, how clear is the scope, really? If weintend to design Data Marts, and we have no specified need for a data warehouseeither to become a systemof record, or to supportMaster Data Management (MDM), then we may chooseto Dr. Ralph Kimball’s Data WarehouseBus architecture, designing a library of conformed (standardized, re-usable) dimension and fact tables for deployment into a series of purpose-builtdata marts. Under these requirements, wemay have no specific need for an Inmon stylethird-normalform (3nf) EnterpriseData Warehouse(EDW) in general, or for a Data Vault in particular. In other cases, however, because sometimes data warehousedata outlives its corresponding sourcedata inside a soon-to-retireapplication database, then, like it or not, a data warehousemay, as Bill Inman remind us, assumea systemof record role for its data. Whereas the Kimball Bus architecture’s tables are often not related via key fields, and in fact may not be populated at all until deployment fromthe Bus into a specific-needs Data Mart, Kimball adherents rarely asserta system-of-record rolefor their solutions. But, supposewedo determine that our required solution either does need to assumea systemof record role, or perhaps that it mustsupportMaster Data Management. As such, wemay elect to design a fully functionalEDW, rather than Kimball’s DW Bus, so that the EDW itself, and not justits dependent data marts, is a working, populated database. Now, knowing that the creation of a classic EDW, with its requirement for an up-front, enterprise-widedesign, is a challenge with today’s expectations for rapid delivery, some may be curious aboutnew design methodologies offer ways to accelerate EDW Design. Data Vault, a data warehousemodeling method with a substantialfollowing in Denmark, and a growing basein the U.S., offers specific and important benefits. In order to set expectations early about Data Vault, readers mustunderstand that, somewhatunlike a traditional EDW, and utterly unlike a star-schema, a Data Vault (not to be confusedwithBusiness DataVault, whichis not addressedinthis article) cannot serve as an efficient presentationlayer appropriate for direct queries. Rather, it is morelike a historic enterprise data staging repository that, with additional downstreamETL, will supportnotonly star-schema, reporting and data mining, but also master data management, data quality and other enterprise data initiatives.
  • 4. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 13 Data Vault Benefits:  Benefit #1: Allows for loading of a history-tracking DW with little or none of the typical extraction, transformation and loading (ETL) transformations that, oncethey are finally figured out, would otherwisecontain subjective-interpretations of the data and which purportedly enhancethe data and prepareit for reporting or analytics. o In my view, this is almost enough of a benefit all by itself. As such, in my introduction that follows, I will focus on proving this point. o Agile Win: Confidently loading a DW without having to already know the fine details of business rules and requirements and the resulting transformation requirements means that loading of historicaland incremental data could get accomplished before the firsttarget databasedesign (3nf EDW or Data Mart) is complete.  Benefit #2: Insofar as Data Vaultprescribes a very generic downstream‘de-constructing’ of OLTP tables, thesede- constructing transformations can beautomated and so can it’s associated early-stageETL into Data Vault. Since, as you’ll soon see, Data Vault causes a substantial increasein the number of tables, this automation potential is a substantialbenefit. o Agile Win: Automated initial design and loading, anyone?  Benefit #3: Due to Data Vault’s generic design logic, it’s use of surrogatekeys (moreon this soon), and it’s prescription to avoid subjective-interpretivetransformations, it’s reasonableto quickly load a Data Vaultjustwith the needed subset of tables. o Agile Win: More frequent releases. Quickly design for, and load, only the data needed for the next release. Use the samegeneric design to load other tables when those User Stories fromthe ProductBacklog get placed into a Sprint. In the remainder of this article, I will provide a high level introduction to Data Vault, with primary emphasis on how it achieves Benefit #1.
  • 5. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 13 High-Level IntroductiontoDataVault Methodology: We begin with a simple OLTP databasedesign for clients purchasing products froma company’s stores. For simplicity, I include only a minimum of fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to DiagramA below. As is common, this simple OLTP schema does not use surrogatekeys. If a client gets a new email address, or a productgets a new name, or a city’s re-mapping of boundary lines suddenly places an existing storein a new city, new values would overwritethe old values, which would then be lost. Of course, in order to preservehistory, history-tracking surrogatekeys are commonly used by practitioners of both Bill Inmon’s classic third-normalform(3nf) EDW design, as well as Dr. Ralph Kimball’s Star Schema method, but both of these methods prescribesurrogatekeys within the context of data transformations thatalso include subjectiveinterpretation (herein simply ‘subjectivetransformation’) in order to cleanse or purportedly enhance the data for the purposes of integration, reporting, or analytics. Data Vault purists claim that any such subjectivetransformation of line-of-business data introduces inappropriatedistortion to it, thereby disqualifying the Data Warehouseas systemof record. Data Vault, importantly, provides a unique way to track historical changes in sourcedata while eliminating most, or all, subjectivetransformations such as field renaming, selective data-quality filters, establishment of hierarchies, calculated fields, and target values. Although analytics-driven, subjectivetransformations can still be applied, they are applied downstreamof the Data Vault EDW, as subsequenttransformations for loads into data marts designed to analyze specific processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will now describe. Before beginning, I recommend against too-quickly comparisons this method others, like star-schema design, which servedifferent needs.
  • 7. __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 13 Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links. The diagram’s Client table as a good example. Hubs work according to the following simplified description: Hub Tables:  Define the granularity of an entity (eg. product), and thus the granularity of non-key attributes (eg. productdescription) within the entity.  Contain a new surrogateprimary key (PK), as well as the sourcetable’s business key, which is demotes fromits PK role. Satellite Tables:  Contain all non-key fields (attributes), plus a set of date-stamp fields  Contain, as a Foreign Key (FK), the Hub’s PK, plus the load date-time stamps.  Have a defining, dependent entity relationship to one, and only one, parent table.  Whether that parent table is a Hub or Link, the Satellite holds the non-key fields fromthe parenttable.  Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute change(eg. a client’s email address changes) upstreamin the OLTP schema (often accomplished up there with a simple over-write), a new row will be added only to the Satellite, and not the Hub, which is why many Satellite rows relate to one Hub row. So, in this fashion, historic changes within sourcetables are gracefully tracked in the EDW. Notice, in DiagramB that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, thoserelationships will appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps.
  • 9. __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 13 Link Tables:  Refer to Diagram C  Relate exactly two Hub tables together.  Contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogatePK.  As with an ordinary association table, a Link is a child to two other tables and, as such, is able to gracefully handle relative changes in cardinality between the two tables and, wherenecessary, can directly resolvemany-to-many relationships that might otherwisecausea show-stopper error in thedata-loading process.  Unlike an ordinary associationtable, the Link table, with its own surrogatePK, is able to track historic changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP sourcetables. Specifically, all loaded data that conformed with the initial cardinality between tables would sharethe same Link table surrogate key, but an unexpected, future sourcedata change that either caused a cardinality reversal(so that the one becomes the many, and vice versa), a new row, with a new surrogatekey, is generated to not only capture it now while the original surrogatekey preserves thehistorical relationship. Slick!  In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogatePK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different storecan be efficiently tracked over time in this way.
  • 11. __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 13 Now, we’veadded Link tables. After scanning DiagramC, go back and compare it withDiagram A and note the movement of the various non-key attributes. Undoubtedly, you will also notice, and may be concerned, that the sourceschema’s fivetables justmorphed into the Data Vault’s twelve. Importantly, notethat the Diagram A’s Details table was transformed notinto a Hub-and-Satellite combination, but rather into a Link table. When you consider that an order detail record (a line item) is really justthe association between an Order and a Product(albeit an association with plenty of vital associated data), then it makes sensethat the Link table Details_l was created. This Link table, whosesole purposeis to relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity and Unit Price. The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated all subjectiveinterpretation!” Perhaps not, but whatI’ll describehere is a pretty small, generic interpretation. Either way, in this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather than the Details_l Link. Added to that, if we use very simple Data-Vaultdesign automation logic, which simply de-constructs all tables into Hub and Satellite pairs, this is whatwe would get. However, keep in mind that if we did that, we would then have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choosethe design that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond the scopeof this article.
  • 12. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 13 Conclusion: Our discussion on Data Vault opened with the idea that an EDW should load and storehistoricaldata withoutapplying any transformations thatcontain subjectiveinterpretation of data or business-rules, becausethoseinterpretations, even if appropriatefor specific reporting or analytics, do modify line-of-business data, and thereforeintroduce distortions into operational data. Those interpretive transformations should occur downstreamduring ETL into presentation layer tables. Although Data Vault does, in fact, apply a specific set of generic ‘de-construction’ transformations, thesetransformations contain little or no subjective interpretation of business rules. They do, however, allow it to (1) apply an appropriatelevel of referential integrity to sourcedata even wherethe sourcesystemmay lack it now or in the future; (2) gracefully capture historical data changes, within and between tables, without endangering the success of the data load; (3) supportloading of data froma subsetof sourcetables initially, and then load, or not load, other related sourcedata tables much later without compromising the EDW’s referential integrity. Lastly, and very importantly; (4) data vault design and the associated Data Vault loading ETL, which is largely generic from one data set to another, can be automated, and thus radically accelerated in development. Although the logic of this automation flows fromthe simplicity of data vault design, a detailed automation discussion is beyond the scope of this article. In closing, if we can automatically design and load a Data Warehouse(albeit not it’s presentation layer), it frees up brain cells for the higher-order logic of design of the presentation layer and the intensive, customETL to load it. As I described here, all of this can be accomplished simultaneously. ________________________________________________ daniel upton dupton@decisionlab.net DecisionLab.Net business intelligence is business performance
  • 13. __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 13 DecisionLab.Net Range of Services: _____________________________________________________ Business Intelligence Roadmapping,Feasibility Analysis BI ProjectEstimation and Requirement Modelstorming BI Staff Augmentation: Data Warehouse / Mart / Dashboard Design and Development _________________________________________________________________________________________________________________________________________________________________________ DanielUpton DecisionLab http://www.decisionlab.net dupton@decisionlab.net Direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA