This keynote looks at some very common forces and threats that are causing common suffering in a data warehouse. Shows examples why the concepts are still relevant despite having all high-end technology. Provides suggestions for starting with architecture and metadata.
5. Data Warehouse Forces and Threats:
interactions that, when unopposed,
will change the motion of
technical, economical, social,
organisational and process structures
with high potential of causing suffering
11. THREATS*
Mistakes in data
might lead not only to wrong business decisions,
but might also have
legal, financial or existential implications.
*serious and real
13. SUFFERING
▪︎ Bad consistency and no transparency
No definitions, too many definitions, obscure definitions. Vague opinions in production.
▪︎ Slow time-to-market
Time from a requirement or from observing a change to deployment in the production takes too
much time.
▪︎ Low performance
… despite having the best hardware, systems, algorithms.
(some of it)
14. How do we know, the data we are looking at
is the data we think we are looking at?
15. PERFORMANCE
We solved “CPU starvation” problem!!!
Why are our [internal] clients getting data 48-72 hours later?
! 20-30x "
⨝
⨝
∑
⨝ ⨝ …⨝
quite big quite a lot
⨝ ⨝
⨝ ⨝
⨝
⨝⨝
⨝
⨝
⨝
stand-alone ETL process/script
27. STARTING WITH ARCHITECTURE
1. Pick one:
If in doubt – any known to work. Any separation of concerns is better than none.
2. Make it formal and documented.
Otherwise our effort will be dissolved and the content swampified.
3. Stick with it for a while and observe.
4. Adjust as necessary.
28. STARTING WITH METADATA
1. Pick a problem
2. Use a spreadsheet
Software at hand, no installation needed; universal, readable and editable by non-engineers.
3. Suffer through the spreadsheet-exchange drill phase
Mirror of our processes – seeing the genuine pain points will be useful later.
4. Use functional approach to metadata composition and application
… from those spreadsheets. Example: relational algebra library in the language of our ecosystem.
99.(later) Move spreadsheets into a metadata repository
29. “HELLO METADATA” PROBLEMS
▪︎ Data quality indicators1
▪︎ Structural (model ↔ schema) consistency check1
▪︎ Automation of common patterns
denormalisation, aggregation, pivot
▪︎ Automate “relationalization” of freely-structured data
JSON → relational
▪︎ Browsability
1non-invasive, non-destructive
30. Doing Things To Data
Doing More Things
To Data
…
Doing Things To Data
Doing More Things
To Data
…
Pipelines without metadata
Pipelines with metadata
metadata
data
31. DATA QUALITY INDICATORS
Doing Things To Data
Doing More Things
To Data
…
metadata
data quality
measurementsdata quality
indicators
data
metadata
definition, computation, warning/error thresholds,
ownership, affected business entity, …
32. COMMON PATTERNS
Automatically Generated Artefacts
Metadata
Manually Crafted Artefacts
IS ∑
denormalize
aggregate
pivot
patterns
∑
controlled growth
probably the same, who knows?
IS ∑
IS ∑
uncontrolled growth
33. VISUALISATION AND EXPLORATION
Browse-ability: How can we explore a metric? How can we drill down?
User Interface
Metadata
Physical Data
Region
…
name
Sales
Revenue
Visits
…
…
3
2
1
id
Cubes
Geography
…
name
Date
2
…
id …
1
Dimensions
Europe
Germany
Berlin
regions
Country
City
Levels
2 region_code
country_name
…
2
Country
3
country_iso
1
key
Region
…nameid
City
2
dim label
2
region_name
city_namecity_code
…
countries
cities
generated
which column?
concept-to-user propagation
34. GET /cube/sales/aggregate? cut=date:2010
& split=status:1&drilldown=date|region
& page=10 page_size=100&
SQL
→
Metadata
Logical Model Physical
Physical Data Store
Query ContextInput
Output
Cube
all attributes
base attributes
⨝ joins
database
metadata
Store
Mapper
locale
parameters
create schema
collect and sort
dependencies
map attributesmappings
mappings of
base attributes
fact table
naming
convention
hierarchies
Star Schema
/❄
compile attributes
base attributesdependant
attributes
columns
make star
(topological sort)
query attributes
SQL Query Contextcreate context
base columns
column expressions for attributes
SELECT, GROUP BY
“star” join statement
FROM
conditions
WHERE
Cubes 1.1 – SQL Query Construction
A,B,C?
SQL
35. TRANSPARENT REPRESENTATIONS
Physical Data Store(s)
Pre-Aggregated
3
rd
Normal Form
source of truth
derived and managed artefacts
Metadata
∑
∑
∑
∑
Multi-Dimensional
Query Server
∑ Aggregator
metadata repository
past 12 months
?
⨝s are expensive
Alternative artefacts: a multi-dimensional data store
37. SHIELD AGAINST FORCES AND THREATS
▪︎ Change
▪︎ Growth (structural)
▪︎ Complexity
▪︎ Threats
financial, legal, existential
38. Force/Threat Architecture Metadata
Change separation of concerns abstraction, generalisation
Growth (structural) separation of concerns, modularity
optimisation through better
reasoning
Complexity separation of concerns, destroy-ability
reduction of problem-space,
coping with heterogeneity
Threats transparency, separation of quality
data accounting,
verifiable data quality,
provable consistency,
source of truth
39. There is path out of the suffering caused by the data warehouse
forces and threats:
The “shield” of architecture and metadata.