WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
How to build a data warehouse - code.talks 2014
1. Martin Loetzsch
How to Build a Data Warehouse?
Project A Ventures, Berlin
!
http://project-a.com
http://twitter.com/martin-loetzsch
2. The “typical startup”
‣ Has data in
• application database
• Excel & csv files
• external tools
‣ Excel based reporting chains
• manual sql queries, CSVs
• copy & paste from external data sources
• difficult to debug and test
• sometimes cranky
!
‣ Everybody pulls their own numbers. # Orders?
!
!
!
!
!
!
!
‣ Does not have “big data”
‣ Will not have “big data” in the relevant future
2 / 25
-- count rows!
SELECT count(*) FROM orders;!
!
-- count everything except test orders!
SELECT count(*) FROM orders!
WHERE is_test IS NULL;!
!
-- count everything that was once paid!
SELECT count(*) FROM orders!
JOIN order_history ON order_fk = order_id!
WHERE status_id = 17;
If Excel works for your company, stick to it
3. Data driven growth requires integrated data
‣ Integrated data = Data Warehouse
csv files
Integrate data!
!
!
!
!
‣ Data in the Data Warehouse is
• the single point of truth
• cleaned up & validated
• easy to access
• embedded in the organisation
‣ Connect data from different domains
3 / 25
application
databases
json files
apis
reporting
marketing
crm
search
pricing
…
DWH
orders
users
products
stocks
prices
emails
clicks
…
4. ‣ 1. Use a BI Solutions by one of the big vendors
!
!
• classic agency business
• takes forever in startup time
• usually too expensive
!
‣ 2. Use a cloud based DWH solution
!
!
• covers only 80% of your business questions
• usually not possible to extend
‣ 3. Build your own, it’s easy!
!
!
• with technology that existed in the 1990s
• simple ETL scripts running inside Postgresql
• open source Pentaho Mondrian as query processor
• own lightweight reporting frontend
• integrated in own shop system
‣ Keep it simple & pragmatic
‣ Don’t use big data technologies if you don’t have big
data
How to build a Data Warehouse?
Invest in own BI infrastructure 4 / 25
5. Basis of any Data Warehouse: fact tables
‣
Works with Excel, SQL frontends, Elasticsearch, Mondrian & other BI front ends
!
!
!
‣ KPIs: aggregations on single columns
‣ All time orders?
!
‣ Revenue October 1st?
!
‣ Sales by product?
!
!
‣ Allowed query operations
• Aggregations (count, distinct-count, sum, avg)
• Filtering
• Grouping
5 / 25
item
id
order
id
has
voucher price day product
1 1 20 09-30 Cat
2 1 10 09-30 Dog
3 2 2 20 09-30 Cat
4 3 30 09-30 Cow
5 4 4 10 10-01 Dog
6 4 4 30 10-01 Cow
# Sold items: count(item_id)
# Orders: distinct-count(order_id)
# Orders with vouchers: distinct-count(has_voucher)
Revenue: sum(price)
Avg product price: avg(price)
SELECT count(distinct order_id) FROM order_item;
SELECT sum(price) FROM order_item WHERE day = ’10-01';
SELECT count(item_id) FROM order_item GROUP BY product;
6. Dimensional modelling
‣ Move redundant categorial data to “dimension” tables
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
product_name
Key challenge: finding good keys
6 / 25
item
id
order
id
has
voucher price day
fk
product
fk
1 1 20 930 1
2 1 10 930 2
3 2 2 20 930 1
4 3 30 930 3
5 4 4 10 1001 2
6 4 4 30 1001 3
day
id
day
name
month
id
month
name
930 09-30 9 Sep
1001 10-01 10 Oct
product
id
product
name
1 Cat
2 Dog
3 Cow
10. Data integration
‣ Visuals ETL tools
• many data source connectors
• hard to debug
• slow to change
Optimize for change speed!
‣ Start with simple sql queries & batch scripts
cat create-tables.sql | psql dwh!
!
cat load-order.sql !
| mysql --skip-column-names source_db !
| psql dwh --command="COPY tmp.order FROM STDIN !
!
!
!
!
‣ Later build something more robust
10 / 25
WITH NULL AS 'NULL'"!
!
cat /data/payment.csv !
| python payment_filter.py!
| psql dwh --command="COPY tmp.payment FROM STDIN” !
!
cat transform-order.sql | psql dwh!
!
11. Data integration in Yves & Zed
11 / 25
‣ Jobs = processing steps with dependencies
• parallel execution with cost based scheduler
• robust, transparent, no black boxes
‣ Parallel jobs & incremental processing
‣ Extensive visualisations & monitoring tools
12. Plain text files
‣ Very git-friendly
12 / 25
<?xml version="1.0" encoding="UTF-8"?>!
<process xmlns="http://project-a.com/dwh-process"!
id=“operational-data" ..>!
!
<initial-job id="initialize-schemas">!
<description>Recreates schemas and writes configs</description>!
<commands>!
..!
</commands>!
</initial-job>!
!
<!-- Orders -->!
<job id="load-order">!
<description>Loads orders into tmp.order</description>!
<commands>!
<execute-sql-file file-name="orders/create-order-tmp-table.sql" echo-queries="true"/>!
<load-from-mysql file-name="orders/load-order.sql"!
target-table="tmp.order" database="app"!
timezone="UTC"/>!
<execute-sql>SELECT tmp.index_tmp_order();</execute-sql>!
</commands>!
</job>!
!
<job id="cleanse-order">!
<description>Deletes test orders and other invalid orders</description>!
<dependencies>!
<dependency job="cleanse-member"/>!
<dependency job="load-order-item"/>!
<dependency job="load-product"/>!
13. MDX = query language for multidimensional data
‣ Developed by Microsoft as part of Analysis Services
• http://en.wikipedia.org/wiki/MultiDimensional_eXpressions
‣
Each KPI is always computed in the same way
!
!
‣
13 / 25
SELECT !
TopCount([Product].[Product].Members, 2,!
[Measures].[Revenue])!
ON COLUMNS,!
[Measures].[Revenue]!
ON ROWS!
FROM [Pet sales]!
WHERE [Date].[Month].[Oct]
SELECT [Date].[Month].Members!
ON COLUMNS,!
CrossJoin({[Measures].[Sold items],!
[Measures].[# Orders], !
[Measures].[Revenue]},!
Descendants([Product].[All products]))!
ON ROWS!
FROM [Pet sales]
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
product_name
14. Mondrian = engine for executing MDX
‣ Open source analytics processor
• http://mondrian.pentaho.com
• http://en.wikipedia.org/wiki/Mondrian_OLAP_server
• In Java
• Eclipse Public License
• Active community
• https://github.com/pentaho/mondrian/
!
‣ Part of Pentaho BI platform
Open source business analytics
William D. Back
Nicholas Goodman
Julian Hyde
M A N N I N G
14 / 25
www.it-ebooks.info
15. Mondrian schema I
‣ The relation between fact tables and dimension tables is defined in a XML file
15 / 25
<Cube name="Pet sales" defaultMeasure="# Orders">!
<Table schema="dim" name="order_item"/>!
!
<Dimension name="Date" type="TimeDimension" foreignKey="day_fk">!
<Hierarchy allMemberName="All dates" hasAll="true" primaryKey="day_id">!
<Table schema="dim" name="day"/>!
<Level name="Month" column="month_id" nameColumn="month_name"!
type="Integer" levelType="TimeMonths" uniqueMembers="true"/>!
<Level name="Day" column="day_id" nameColumn="day_name"!
type="Integer" levelType="TimeDays" uniqueMembers="true"/>!
</Hierarchy>!
</Dimension>!
!
<Dimension name="Product" foreignKey="product_fk">!
<Hierarchy hasAll="true" allMemberName="All products" primaryKey="product_id">!
<Table schema="dim" name="product"/>!
<Level name="Product" column="product_id" nameColumn="product_name"!
type="Integer" uniqueMembers="true"/>!
</Hierarchy>!
</Dimension>!
!
..!
</Cube>
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
product_name
16. Mondrian schema II
‣ Measures as defined as aggregates on columns
Each KPI is always computed in the same way
!
!
!
!
‣ Mondrian = SQL query generator
16 / 25
SELECT [Date].[Month].Members!
ON COLUMNS,!
[Measures].[Avg cart value]!
ON ROWS!
FROM [Pet sales]
SELECT!
"day"."month_id" AS "c0",!
count(DISTINCT "order_item"."order_id") AS "m0",!
sum("order_item"."price") AS "m1"!
FROM!
"dim"."day" AS "day",!
"dim"."order_item" AS "order_item"!
WHERE!
"order_item"."day_fk" = "day"."day_id"!
GROUP BY!
"day"."month_id"
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
<Cube name="Pet sales" defaultMeasure="# Orders”>! product_name
..!
<Measure name="# Orders" column="order_id" datatype="Integer" aggregator="distinct-count" formatString="Standard"/>!
!
<Measure name="Revenue" column="price" datatype="Integer" aggregator="sum" formatString="Currency"/>!
!
<Measure name="Sold items" column="item_id" datatype="Integer" aggregator="count" formatString="Standard"/>!
!
<CalculatedMember name="Avg cart value" dimension="Measures">!
<Formula>[Measures].[Revenue] / [Measures].[# Orders]</Formula>!
</CalculatedMember>!
</Cube>!
!
➞ ➞
17. Mondrian schema III
‣ Everything about KPIs & dimensions (business) and
tables & columns (IT) in one file
• consistent & explicit semantics
• transparency is easy
Always draw your Mondrian schema!
17 / 25
18. Ad-hoc queries with Saiku Analytics
‣ Drag & drop reporting tool on top of Mondrian
• Open source (Apache 2.0)
• Talks to Mondrian via MDX
• http://meteorite.bi/saiku
Try it out immediately, it’s amazing: http://demo.analytical-labs.com/
18 / 25
19. Reports in Yves & Zed I
‣ Own lightweight reporting frontend
• bootstrap/ Google charts
• lacks many features
• features are easy to implement
Numbers are random! 19 / 25
20. Reports in Yves & Zed II
‣ Dashboard-like interactive reports
• maintained by developers
• each table / chart is an MDX query
Numbers are random!
20 / 25
21. XMLA = XML for Analysis = MDX via SOAP
‣ Industry standard originally proposed by Microsoft
• http://en.wikipedia.org/wiki/XML_for_Analysis
• Soap protocol to discover and query OLAP cubes
• Mondrian has an XMLA server
‣ Request
‣ Response
21 / 25
<?xml version="1.0" encoding="UTF-8"?>!
<SOAP-ENV:Envelope xmlns:SOAP-ENV=“..”>!
<SOAP-ENV:Header/>!
<SOAP-ENV:Body>!
<Execute xmlns="urn:schemas-microsoft-com:xml-analysis">!
<Command>!
<Statement>!
<![CDATA[!
SELECT [Date].[Month].Members!
ON COLUMNS,!
[Measures].[Avg cart value]!
ON ROWS!
FROM [Pet sales]!
]]>!
</Statement>!
</Command>!
<Properties>!
<PropertyList>!
<Catalog>dwh</Catalog>!
<DataSourceInfo>Monsai</DataSourceInfo>!
<Format>Multidimensional</Format>!
<?xml version="1.0" encoding="UTF-8"?>!
<SOAP-ENV:Envelope xmlns:SOAP-ENV="..">!
<SOAP-ENV:Header ../>!
<SOAP-ENV:Body>!
<cxmla:ExecuteResponse xmlns:cxmla="urn:schemas-microsoft-<cxmla:return>!
<root>!
<OlapInfo ../>!
<Axes>!
<Axis name=“Axis0" ../>!
<Axis name="Axis1">!
<Tuples>!
<Tuple>!
<Member Hierarchy=“Measures" ..>!
</Tuple>!
</Tuples>!
</Axis>!
<Axis name=“SlicerAxis" ../>!
</Axes>!
<CellData>!
<Cell CellOrdinal="0">!
<Value xsi:type="xsd:double">26.666666666666668</<FmtValue>26,67 €</FmtValue>!
<FormatString>Standard</FormatString>!
</Cell>!
<Cell CellOrdinal="1">!
<Value xsi:type="xsd:double">40</Value>!
<FmtValue>40,00 €</FmtValue>!
<FormatString>Standard</FormatString>!
</Cell>!
22. Data Warehouse in Yves & Zed
!
!
!
!
!
!
csv files
!
!
!
!
!
data integration monsai reporting
MDX results
database
mapping
!
‣ monsai = Mondrian XMLA Server + Saiku in a single war file, https://github.com/project-a/monsai
22 / 25
application
databases
json files
apis
SQL SQL
DB results
XMLA / MDX
Mondrian XMLA response
Mondrian schema
23. What kind of people do you need to hire for this?
‣ The “typical BI expert”:
• studied something related to business and learnt VBA
programming through Excel
• relies on others to set up databases and tools
‣ Your ideal candidate
• has studied computer science
• masters the basic tools of software development and
computer science
• likes to learn new technologies
• understands how databases work
‣ Good profile example:
http://www.project-a.com/en/careers/jobs/?yid=332
Job opportunity Data Engineer / Data Scientist (m/f) at Projec... https://karriere.project-a.com/eng?yid=For our "A-Team" we are looking to fill the following position as soon as possible
Data Engineer / Data Scientist (m/f)
Your tasks:
You will help our business intelligence team to build data driven applications for our ventures:
data warehouses, recommendation engines and CRM systems (developed in-house, based
on open-source technologies)
You will integrate, transform and index data from various data sources, develop meaningful
data representations and visualisations, and provide aggregated data for third-party systems
You will advance our software architecture and tool set to growing challenges and data
amounts (performance, scaling, data quality)
You will work in an agile software development process in close collaboration with a product
management team
Your profile:
You have a Master's degree in computer science or a comparable degree
You have a genuine interest in data and algorithms and you are excited about solving difficult
problems and strive for efficient and robust solutions
You master at least these basic tools of computer science: object oriented programming in
multiple languages, HTTP and current web technologies, the unix command line and basic
server administration, version control systems, a basic understanding of the interplay
between software and memory, hard discs and the CPU
You have profound knowledge about the inner workings of database systems
You are eager to delve into new technologies and programming languages (our current
stack: Mac or Linux, PostgreSQL, Mondrian & MDX, PHP, Java, Python, Solr, ElasticSearch,
R)
You have a basic understanding of mathematics and machine learning
Your chance:
23 You will join a highly professional and motivated team
You will have the unique opportunity to witness the launch of a newly established company
and you can contribute your own ideas to its development
Search for computer scientists, not business intelligence experts
/ 25
24. Use a standard software engineering process!
‣ Product managers: what?
• Collection of business requirements
• KPI & report definitions
• QA & analysis
!
Any kind of Scrum / Kanban works, do it
‣ Developers: how?
• Implementation, performance & stability
• Schema & process design
• Consistency checks
Avg
net
revenue
per
buying
member
%
Contribution
margin
1
24 / 25
Net
revenue
Net
voucher
cost
Avg
net
voucher
cost
per
order
Contribution
margin
3a
Tax
shipping
amount
Tax
amount
Gross
revenue
Avg
gross
item
price
Gross
price
to
gross
retail
price
ratio
Price
to
retail
price
ratio
Avg
gross
order
value
%
Gross
voucher
cost
Gross
invoiced
amount
Net
invoiced
amount
Gross
retail
price
Net
price
to
net
purchase
price
ratio
Net
price
to
net
retail
price
ratio
%
Net
discount
Avg
gross
invoiced
amount
HGB
net
revenue
margin
Avg
gross
voucher
cost
per
buying
member
Net
item
revenue
Tax
item
amount
Net
purchase
cost
Net
retail
price
Retail
tax
amount
Gross
voucher
cost
Net
shipping
revenue
Gross
shipping
revenue
25. Thank you
Data integration is easy if you keep things simple!
http://www.project-a.com/
25 / 25