❤Personal Contact Number Varanasi Call Girls 8617697112💦✅.
Data modelingzone geoffrey-clark-v2
1. Physical Database Design for
MPP and Columnar Databases
Geoffrey Clark
Principal at Lucidata, Inc.
September 2013
copywrite, Lucidata, 2013
2. Conceptual, Logical, Physical
• Conceptual links to Business Strategy.
– This is now becoming more quantitative
• Logical maps to the Business Semantics.
– Con-way example
• Physical maps to your Data Stores
– These will be more varied and heterogeneous in
the future, due to specialization.
copywrite, Lucidata, 2013
3. HBR Business Strategy
The New Dynamics of Competition, Michael D. Ryall, Harvard Business Review, June 2013
Michael Porter’s Five Forces
has dominated strategic
and competitive analysis
since 1979. This analysis
has largely been conceptual
in nature.
Quantitative analysis on
structured data in context is
changing the nature of
business culture, and
improving business
decisions.
This drives the demand for
data modeling and
management.
copywrite, Lucidata, 2013
4. Design and Evolution
• Hierarchies
– 14th Century Europe and the Financial Revolution
– Aggregations & Allocations
• Cards, Tapes – physical analog media
• Computer Science
– Moore’s Law
• Processor Speed Improvements
• Memory Improvements
• Media Improvements – Punch Cards, Tape, Disk, Memory
• Design for Context & the Future
– Character encoding - Internationalization
– Calendars – Gregorian, Fiscal, Lunar, ... Y2K?
• Files and Fields
– Separation of Data and Metadata
– Modern versions -> XML, JSON
• Joins!
– Data Sets – Super types, Sub types
– Associations describe Networks!
copywrite, Lucidata, 2013
7. Separation of Church and State
• Operational uses
– Capture the data, hand-entered <- validation
– A Data Flow, such as Order to Cash cycle
– Con-way example of PRO(-gressive) numbers
• Analytical uses
– Desire for reports, Reporting crashes the
Operational cycle, Cash flow problem.
– Banished from OLTP, go make an ODS
copywrite, Lucidata, 2013
8. The Star Schema
The purpose of business computers is to sort data. A graphical
representation of sorted data is called a ‘Star Schema’.
– Michael Silves, Principal at Datamorphosis
• The right design at the right time, becomes default doctrine for DW
– Early RDBMS (Relational Data Base Management Systems)
• Low memory, slow disks, slow CPU
• Big Demand, with questions that spanned the datasets
• Performance issues over large datasets
– Interview Business people to get questions
• Pre-process the data, based on business questions
– Separation into Dimensions and Facts/Metrics
• Link to Business Semantics
• OLAP (On-Line Analytical Processing)
• Educate Users on Aggregation and Allocation
• Conformed Dimensions across Departments to give an Enterprise-wide view of the data.
• But as technology changes, problems emerge
– Ad-hoc questions require redesign & rework
– With business hierarchies when one concept is both a fact & dimension, e.g. Shipment
– Fact tables become difficult to distribute for MPP ... e.g. Teradata prefers a normalized DW
• Example – transportation networks
copywrite, Lucidata, 2013
9. Example – Multi-Modal Freight
• Shipments are agreements between a Carrier and a
Shipper to move goods between two places.
• Shipments can be split into “ProFreight” (which is
assigned a cost via activity-based costing).
• Shipments/ProFreight are composed of Freight
handling units.
• Freight can be “re-tendered” to another carrier, in
which case is is linked to the original and the new
Shipment.
• Freight moves between places on one or many “VFCs”
or Containers.
• Containers are moved between places on Trips.
copywrite, Lucidata, 2013
13. Dim Modeling Dogma
• “Our carefully normalized data model can not
be translated into a star schema... “
– Dimensional modeling is necessary in order to
generate correct queries
– Any (normalized) data model can be transformed
in a dimensional model...
– ... and there exists an algorithm to do it
copywrite, Lucidata, 2013
16. Bridge table
(remember, we tried this)
We tried this with
hesmith When
selecting a main
hierarchy is has
too much of a
downside, and
you don’t have a
weight factor …
copywrite, Lucidata, 2013
22. Information Factory & MPP
• Normalized Base
– Integrate data once
• Source -> Normalized -> Denormalized -> OK
• Source -> Denormalized? -> Un-normalized -> ?
– Detect problems and fix them once!
• Does not preclude Data Marts
• Massive Parallel Processing
– Data distribution
• Optimizations – Broadcast, Co-location, Re-distribution
• Scalability, the quest for 1:1
• Normalized data - reduced IO, better match for
copywrite, Lucidata, 2013
28. Cubes and In-memory BI
• Multi-Dimensional OLAP (MOLAP)
– Drag-and-Drop OLAP environment, analysts
become capable of self-service.
– Dealt with Ragged Hierarchies, common in
Financial data such as General Ledger (GL)
– Limited by memory size
– Pressure for more dimensionality floods cube size,
build times from relational sources exceed load
windows ...
• Relational OLAP (ROLAP)
copywrite, Lucidata, 2013
29. But a network this size choked it
copywrite, Lucidata, 2013
30. Columnar vs Row-wise
• Physically store data by Column vs Row
– Rather like Fifth Normal Form.
– If Semantically Organized, then Rapid Response to
user’s ad-hoc aggregation requests.
– Prefers batch loading, always loads once per
column, even if loading one row.
• Continues to Appear and Operate as a normal
Row-wise cousin.
copywrite, Lucidata, 2013
31. Columnar IO example
Compression becomes
much more effective
Reading a Column is
like reading a Row
copywrite, Lucidata, 2013
32. Design Pattern for Log Data
Data Stewards for
Master Data
Data Stewards for
Metadata
Architects
integrate data
and metadata
Architects
organize data for
analysis with
physical in mind
Architects identify levels for
analysis, and distributionColumnar
MPP
copywrite, Lucidata, 2013
37. Hadoop (Cloudera & Hortonworks)
“Although it’s true that Hadoop can be valuable as an analytic silo, most
organizations will prefer to get the most business value out of Hadoop by
integrating it with—or into—their BI, DW, DI, and analytics technology
stacks.” – Philip Russom TDWI
http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspx
copywrite, Lucidata, 2013
38. Hadoop for Analytics?
Analytics performs
best on Structured
Data, for good
reasons.
Maintain MPP strengths in
the solution through
Architecture.
copywrite, Lucidata, 2013
39. Message from Hortonworks (Hadoop)
“Although it’s true that Hadoop can be valuable as an analytic silo, most
organizations will prefer to get the most business value out of Hadoop by
integrating it with—or into—their BI, DW, DI, and analytics technology
stacks.” – Philip Russom TDWI
http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspxcopywrite, Lucidata, 2013