Learn to model data to be visible and accessible between NOSQL Big Data repositories and your RDBMS Data Warehouse. Learn how specific RDBMS Data Warehouse data modeling approaches establish flexible integration with NoSQL data sets that do not play by E.F. Codd’s rules.
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Data Modeling for Integration of NoSQL with a Data Warehouse
1. DecisionLab.Net
business intelligence is business performance
_________________________________________________________________________________________________________________________________________________________________________________________________________________________
________________________________________________________________________________________________________________________________________________________________________________________________________________________
DecisionLab http://www.decisionlab.net dupton@decisionlab.net direct 760.525.3268
Carlsbad, California, USA
Data Modeling for
Integration of NoSQL
with a Data Warehouse
by daniel upton
How
3. __________________________________________________________________________________________________________________________________________________________________________________
Page 3 of 28
Opening Questions…
o Why model data?
o What role does visualization play in a data model?
o Why integrate RDBMS data warehouse with NoSQL data?
o What do I mean by “integrate data”?
o Why model data for an integration between an RDBMS Data Warehouse and NoSQL?
o Should data ever be moved between a Data Warehouse and NoSQL? If so, which way?
o Regardless of a decision to move data or not, at what stage in an RDBMS DW environment should we integrate with NoSQL?
o Staging, EDW, Star Schema, Extracts?
o How do Lean and Agile thinking influence our choice between these stages or methods?
o What does a useful model for data integration between NoSQL and RDBMS communicate?
o How well do various Data Modelling methods support integration with NoSQL?
o Does a data scientist need a star schema, or any single version of the truth structure, to obtain needed answers?
o What are some practical guidelines for when we actually need to accomplish this?
8. __________________________________________________________________________________________________________________________________________________________________________________
Page 8 of 28
What exactly do I mean by “data integration”?
o Identify and “instantiate” joins at specific granularities between different data sets according to
specific common topics in both sets – customer, click, like, product, purchase, inventory, shipment –
to exploit now and later.
o Actual data movement and ETL are optional.
Why model for Integration of RDBMS and NoSQL?
o Very effective process for defining, visualizing, validating and communicating even more complex
data structures in current or desired state.
Quick Tips: For good ‘Model-Level Integration’ of RDBMS and NoSQL…
o Keep it simple and source-facing. Avoid complex data transformation
o Model for simple equi-join relationships: ‘one to many’ or ‘one to one’
Future Modeling Technology:
o Forward- and reverse-engineering to / from a combined, integrated RDBMS and NoSQL information
system.
11. __________________________________________________________________________________________________________________________________________________________________________________
Page 11 of 28
What about integration without data movement… without ETL?
Tip: Between the DW and NoSQL, avoid data movement just for the sake of integration.
o The goal of data integration is not, by itself, a sufficient justification to either move or substantially
transform the data, because of the additional overhead that such movement and transformation
requires.
_____________________
Regardless of a decision to move or not move the data, at what stage in a Data Warehouse environment
should we integrate with NoSQL? Staging, EDW, Star Schema, Extracts?
o Staging: Increments of data, lacking enforceable referential integrity, so inherently non-integrated,
thus offers low integration potential.
12. __________________________________________________________________________________________________________________________________________________________________________________
Page 12 of 28
Enterprise Data Warehouse (Inmon): Entity-Relational Model
o ~3rd
Normal Form, Date-Stamped Composite
Primary Keys, No Surrogates
o Strategy to enforce a single version of the truth
(SVOT), so each characteristic (attribute)
something (entity) exists in just one field and
one table, with each instance as one record.
o Inherent, intentional rigid interdependence
between classic 3NF tables, based on foreign
key constraints
o Pristine data structure is often too rigidly
normalized for model-level integration with
NoSQL structures that play by different rules.
o Lean / Agile Score?: Low. Rigid table
structure with strong functional dependencies
and specific cardinality baked into design.
SVOT design requires data transformation from
other sources to comply.
13. __________________________________________________________________________________________________________________________________________________________________________________
Page 13 of 28
Dimensional / Star Schema: Either as standalone DW Bus (Kimball), or downstream from EDW as data presentation layer.
o Intent is to present a SVOT for pre-defined
analyses baked into star schema
o Rigid functional dependence between tables
o Descriptive data is now in a de-normalized
dimension table with foreign key relationships
only to fact tables containing quantitative
fields.
o Lean / Agile Score: Even lower. Even
more rigid structure, with added surrogate-
keys wherein dimensions relate only to
existing RDBMS fact tables for pre-defined
analyses. Unique ID’s such as
Department_Code_14 become non-unique
(denormalized), thus weaker for new
integrations.
o NoSQL Integration requires new Star
Schema tables.
14. __________________________________________________________________________________________________________________________________________________________________________________
Page 14 of 28
…and…
o To use a pattern-based separation of keys, attributes, and relationships to accomplish the above while remaining transparently
equivalent and auditable to source data.
o Lean / Agile Score: High. Each ensemble stands alone. Hubs, the sole integration point to other ensembles, have zero
functional dependencies. Relationship cardinality between ensembles becomes an association, accepting any cardinality
based on actual data, not pre-defined business rules. New data subject areas (ensembles) are easily added and introduce
zero new functional dependencies on existing structure.
Data Vault Method:
o Summary of Hubs, Satellites, Links,
Ensembles (Linstedt, Hultgren, Graziano).
o Align data records, via their business keys,
across tables and across systems.
o Track changes to source data records while
maintaining or enhancing actual referential
integrity between related tables.
o To defer the following-- (a) the renaming of
source attributes per DW naming standards;
(b) the selection of desired fields and records
to present for reporting; and (c) any application
of subjective business rules or an SVOT
attempt, until immediately downstream of the
model -- to a Star Schema or Semantic Layer.
17. __________________________________________________________________________________________________________________________________________________________________________________
Page 17 of 28
o Even more different than RDBMS, the hierarchic data model of a document-based NoSQL store involves nested attributes
(with or without unique identifiers)
o Example: JavaScript Object Notation (JSON) Document: Student Likes Major (same content)
{
“MajorID”: “985”, -- Top-level (parent) object with ID (Business Key)
“MajorName”: “Data Science”,
“Student_Likes_Major”: { -- Nested (child) object with ID’s (Business Key…
“Student_1_Likes_Major”: -- for reliable equi-joins on Student_ID)
{ “Student_ID”: “1357”,
“Student_Name”: “Hannah Shelby”,
“Student_Like_Major_As_Role”: “2nd
Major”,
“Date_Liked”: “2015_0804”,
“Student_Like_NumDays_After_Survey_Posted_Social”: “4” },
“Student_2_Likes_Major”:
{ “Student_ID”: “2468”,
“Student_Name”: “David Bookman”,
“Student_Like_Major_As_Role”: “Minor”,
“Date_Liked”: “2015_0801”,
“Student_Like_NumDays_After_Survey_Posted_Social”: “1” },
…
“Student_N_Likes_Major”:
…},
“Major_Academic_Counselor_Current”: [ -- Nested Array (no ID’s; no reliable equi-joins)
{ “Counselor”: “Ms. Jenny Davis, M.Ed”,
“Counselor_Specialty_Name”: “Career Prep” },
],
}
18. __________________________________________________________________________________________________________________________________________________________________________________
Page 18 of 28
Business Scenario:
o In Social Network Survey, a (one) university student Likes multiple combinations of 1st
Majors, 2nd
Majors, and
Minors, but the University has not officially allowed them, nor do core OLTP systems support them.
o Registrar OLTP and legacy 3NF EDW use business rule that only allows Many Students [enrolled in] One Major.
Objectives:
o Build a new RDBMS Data Warehouse / Business Intelligence Solution
o With little or no modifications, “Production-alize” existing NoSQL data repositories from the Social Network
(which uses Cassandra and/or a JSON Document Store), and then somehow integrate that data with the above
planned DW / BI for integrated analytics combining students liking major-combinations with other analytically
interesting data (eg. actual major, academic standing, credits earned, GPA) in the registrar system.
Implementation Goals:
o Assumption: Available (generic) virtualization API (Polybase, Talend, Informatica, etc.) in which we abstract-out
and then visually map fields between RDBMS and NoSQL Data fields in existing structures and, once mapped,
can also query and join these mapped data sets simultaneously, for real-time analytics, or as a semantic layer
with which to subsequently move data, either way based on business requirements as they unfold.
o No ETL, no new fields in existing tables, and no new RDBMS Tables.
27. __________________________________________________________________________________________________________________________________________________________________________________
Page 27 of 28
Recommendations:
1. Differentiate short-lived vs. long-lived NoSQL data structures: For integration with RDBMS, prefer long-lived, reasonably
modeled NoSQL data sets.
2. Criteria for good ‘Model-Level Integration’ of RDBMS and NoSQL:
o Keep it simple. Model for simple ‘one to many’ or ‘one to one’ equi-join relationships.
o Excessive model-level data transformation is the killer of transparency in an integration data model.
3. For NoSQL integration target files / documents / tables, insist on the equivalence of…
o 1st
Normal Form: In every record, each cell holds only one value. Higher normalizations are obviously better.
o Identifier fields (eg. integers), as key candidates, exist and correspond to each in-scope ‘name’ attribute.
o Clearly distinguish data warehouse from data presentation layer (eg. Star Schema), and don’t over-burden DW itself with
analytic-requirements-driven, highly-transformed (brittle) SVOT attempt.
o Save SVOT transforms and other business rules for downstream ETL into data presentation.
o Strive for generic, loosely-coupled integration without ETL at the EDW Level.
4. Minimize data movement between RDBMS and NoSQL in order to simplify integration and reduce overhead cost.
5. Design loosely-coupled Lean Data Warehouses, rather than tightly-dependent data warehouse / mart as all-at-once attempts at
the elusive SVOT, thus drawing a sharp distinction between where the lobster is caught and cooked from where it is served
with wine and song to your valued customers.
28. __________________________________________________________________________________________________________________________________________________________________________________
Page 28 of 28
DecisionLab.Net
Services:
_____________________________________________________________________
Data Warehouse / Business Intelligence Envisioning, Assessment, Roadmap, and Assessment
Expert DW-BI Staff Augmentation:
Data Warehouse / Mart / Analytics Architecture, Requirements, Models and Development
________________________________________________________________________________________________________________
Slides available now at… slideshare.net/DanielUpton/
_______________________________________________________________________________________________________________
Daniel Upton dupton@decisionlab.net
Carlsbad, CA blog: http://www.decisionlab.net phone 760.525.3268