How to build a data warehouse - code.talks 2014

Martin Loetzsch
How to Build a Data Warehouse?
Project A Ventures, Berlin
!
http://project-a.com
http://twitter.com/martin-loetzsch

The “typical startup”
‣ Has data in
• application database
• Excel & csv files
• external tools
‣ Excel based reporting chains
• manual sql queries, CSVs
• copy & paste from external data sources
• difficult to debug and test
• sometimes cranky
!
‣ Everybody pulls their own numbers. # Orders?
!
!
!
!
!
!
!
‣ Does not have “big data”
‣ Will not have “big data” in the relevant future
2 / 25
-- count rows!
SELECT count(*) FROM orders;!
!
-- count everything except test orders!
SELECT count(*) FROM orders!
WHERE is_test IS NULL;!
!
-- count everything that was once paid!
SELECT count(*) FROM orders!
JOIN order_history ON order_fk = order_id!
WHERE status_id = 17;
If Excel works for your company, stick to it

Data driven growth requires integrated data
‣ Integrated data = Data Warehouse
csv files
Integrate data!
!
!
!
!
‣ Data in the Data Warehouse is
• the single point of truth
• cleaned up & validated
• easy to access
• embedded in the organisation
‣ Connect data from different domains
3 / 25
application
databases
json files
apis
reporting
marketing
crm
search
pricing
…
DWH
orders
users
products
stocks
prices
emails
clicks
…

‣ 1. Use a BI Solutions by one of the big vendors
!
!
• classic agency business
• takes forever in startup time
• usually too expensive
!
‣ 2. Use a cloud based DWH solution
!
!
• covers only 80% of your business questions
• usually not possible to extend
‣ 3. Build your own, it’s easy!
!
!
• with technology that existed in the 1990s
• simple ETL scripts running inside Postgresql
• open source Pentaho Mondrian as query processor
• own lightweight reporting frontend
• integrated in own shop system
‣ Keep it simple & pragmatic
‣ Don’t use big data technologies if you don’t have big
data
How to build a Data Warehouse?
Invest in own BI infrastructure 4 / 25

Basis of any Data Warehouse: fact tables
‣
Works with Excel, SQL frontends, Elasticsearch, Mondrian & other BI front ends
!
!
!
‣ KPIs: aggregations on single columns
‣ All time orders?
!
‣ Revenue October 1st?
!
‣ Sales by product?
!
!
‣ Allowed query operations
• Aggregations (count, distinct-count, sum, avg)
• Filtering
• Grouping
5 / 25
item
id
order
id
has
voucher price day product
1 1 20 09-30 Cat
2 1 10 09-30 Dog
3 2 2 20 09-30 Cat
4 3 30 09-30 Cow
5 4 4 10 10-01 Dog
6 4 4 30 10-01 Cow
# Sold items: count(item_id)
# Orders: distinct-count(order_id)
# Orders with vouchers: distinct-count(has_voucher)
Revenue: sum(price)
Avg product price: avg(price)
SELECT count(distinct order_id) FROM order_item;
SELECT sum(price) FROM order_item WHERE day = ’10-01';
SELECT count(item_id) FROM order_item GROUP BY product;

Dimensional modelling
‣ Move redundant categorial data to “dimension” tables
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
product_name
Key challenge: finding good keys
6 / 25
item
id
order
id
has
voucher price day
fk
product
fk
1 1 20 930 1
2 1 10 930 2
3 2 2 20 930 1
4 3 30 930 3
5 4 4 10 1001 2
6 4 4 30 1001 3
day
id
day
name
month
id
month
name
930 09-30 9 Sep
1001 10-01 10 Oct
product
id
product
name
1 Cat
2 Dog
3 Cow

Real life schemas I
‣ https://www.contorion.de/
early stage project
order_item_status
order_item_status_id
order_item_status_sort_id
order_item_status_name
order_item_status_mapping
order_item_status_fk
order_item_status_partition_fk
sales_event
sales_event_id
order_item_fk
order_item_current_status_fk
order_timestamp
event_timestamp
hours_since_order
hours_since_last_event
hours_to_next_event
estimated_net_revenue
order_item_status_partition
order_item_status_partition_id
order_item_status_perspective_id
order_item_status_perspective_name
order_item_status_process_id
order_item_status_process_name
order_item_status_group_id
order_item_status_group_name
order_item
order_item_id
order_fk
merchant_fk
product_fk
category_fk
category_tree_fk
order_process_fk
processed_order_item_id
net_shipping_revenue
tax_amount_shipping
gross_voucher_value
net_voucher_value
gross_revenue_before_voucher
net_item_value
gross_item_value
tax_amount_before_voucher
tax_amount_voucher
gross_shipping_revenue
gross_shipping_revenue_before_voucher
net_purchase_cost
gross_purchase_cost
net_revenue_returned
net_revenue_canceled
net_payment_cost
net_return_cost_and_loss_and_fraud
net_shipping_and_fulfillment_cost
net_marketing_expenses
address
zip_code
first_name
last_name
city
country_fk
gender
account_disabled
company
phone
cell_phone
country
country_id
country_name
order
order_id
increment_id
order_type_fk
is_first_order_id
is_follow_up_order_id
is_second_order_id
is_second_or_subsequent_order_id
customer_fk
returning_customer_fk
order_rank_fk
items_per_order_fk
payment_method_fk
payment_provider_fk
zip_code_fk
order_rank_1st_fk
order_rank
order_rank_id
order_rank_name
order_rank_group_id
order_rank_group_name
customer
customer_id
increment_id
customer_name
email
number_of_orders
first_order_date
last_order_date
avg_days_between_orders
number_of_orders_with_vouchers
phone
company
gender_fk
customer_type_fk
customer_group_fk
customer_industry_fk
order_type
order_type_id
order_type_name
order_date
day_fk
hour_of_day_fk
day_of_week_fk
order_fk
order_date_perspective_fk
customer_date
day_fk
customer_fk
customer_date_perspective_fk
order_date_perspective
order_date_perspective_id
order_date_perspective_name
hour_of_day
hour_of_day_id
hour_of_day_name
day_of_week
day_of_week_id
day day_of_week_name
day_id
day_reversed_id
day_name
year_id
year_reversed_id
iso_year_id
iso_year_reversed_id
quarter_id
quarter_name
month_id
month_reversed_id
month_name
week_id
week_reversed_id
week_name
customer_date_perspective
customer_date_perspective_id
customer_date_perspective_name
gender
gender_id
gender_name
customer_type
customer_type_id
customer_type_name
customer_group
customer_group_id
customer_group_name
sales_event_duration_perspective
sales_event_duration_perspective_id
sales_event_duration_perspective_name
sales_event_duration
sales_event_fk
duration_fk
sales_event_duration_perspective_fk
duration
duration_id
days
days_name
weeks
weeks_name
months
months_name
quarters
quarters_name
years
years_name
sales_event_date
sales_event_fk
day_fk
sales_event_date_perspective_fk
sales_event_date_perspective
sales_event_date_perspective_id
sales_event_date_perspective_name
product
product_id
sku
ean
sales_sku
merchant_sku
product_name
category
category_id
category_parent_fk
category_name
newsletter_event
newsletter_event_id
day_fk
campaign_fk
customer_increment_fk
sent
bounce
bounce_block
bounce_soft
bounce_hard
bounce_reason_fk
open
first_open
click
first_click
url_fk
complaint
subsequent_order_fk
first_order
gross_revenue_before_voucher
gross_voucher_value
net_voucher_value
tax_amount_before_voucher
net_purchase_cost
url
cost_per_campaign_and_day
day_fk
campaign_fk
number_of_clicks
imported_cost_mci
imported_cost_api
cost_of_clicks_directly_assigned
cost_of_clicks_campaigns_without_clicks
cost_of_clicks_unknown_campaign
campaign
campaign_id
campaign_name
subject
sent_hour
sent_date
campaign
campaign_id
campaign_name
level_3_id
level_3_name
level_2_id
level_2_name
channel_fk
channel
channel_id
channel_name
Star schema, galaxy schema, nth normal form? Doesn’t matter, do what’s fastest.
7 / 25

campaign_click_performance
campaign_click_fk
performance_attribution_model_fk
attribution_path_segment_fk
number_of_signups
number_of_activations
number_of_transactions
gross_revenue
campaign_click
campaign_click_id
visitor_id
day_fk
campaign_fk
user_fk
path_segment_fk
path_position_fk
reverse_path_position_fk
step_fk
next_step_fk
step_reverse_fk
next_step_reverse_fk
number_of_clicks
number_of_new_visitors
number_of_daily_visitors
number_of_monthly_visitors
duration_fk
time_to_end
marketing_cost
path_segment
path_segment_id
path_segment_name
performance_attribution_model
performance_attribution_model_id
performance_attribution_model_name
user_date
day_fk
user_fk
user_time_perspective_fk
cost_per_campaign_and_day
day_fk
campaign_fk
number_of_clicks
imported_cost_mci
imported_cost_api
cost_of_clicks_directly_assigned
cost_of_clicks_campaigns_without_clicks
cost_of_clicks_unknown_campaign
campaign
campaign_id
campaign_name
level_3_id
level_3_name
level_2_id
level_2_name
channel_fk
corridor_fk
conversion_path_transition
conversion_path_transition_id
conversion_path_transition_name
channel_with_position_id
channel_with_position_name
reverse_path_position
reverse_path_position_id
reverse_path_position_name
path_position
path_position_id
path_position_name
user
user_id
number_of_users
customer_id
number_of_customers
repeat_customer_id
gender_fk
age_fk
user_city_fk
user_state_fk
user_country_fk
most_freq_corridor_fk
total_transaction_range_fk
referral_source_fk
transaction_frequency_fk
has_sent_cash_id
has_sent_airtime_id
has_sent_cash_and_airtime_id
sent_amount_money_transfer
number_of_transactions_with_voucher
fees
fx_gain
sent_amount_airtime
voucher_cost_money_transfer
voucher_cost_airtime
days_between_signup_and_first_transaction
days_between_signup_and_second_transaction
days_between_first_and_second_transaction
days_between_second_and_third_transaction
average_days_between_transactions
days_since_last_transaction
days_since_last_login
day
day_id
day_name
year_id
iso_year_id
quarter_id
quarter_name
month_id
month_name
week_id
week_name
day_of_week_id
day_of_week_name
day_of_month_id
day_of_month_reversed_id
number_of_days_in_month
duration
duration_id
days
days_name
weeks
weeks_name
months
months_name
quarters
quarters_name
years
years_name
conversions
conversions_name
channel
channel_id
channel_name
corridor
corridor_id
corridor_name
sender_country_fk
sender_country_name
receiver_country_fk
receiver_country_name
campaign_cohort
user_fk
campaign_fk
channel_fk
day_fk
duration_fk
gross_revenue
age
age_id
age_name
age_group_id
age_group_name
user_city
user_city_id
user_city_name
user_state_id
user_state_name
user_country_fk
user_country_name
gender
gender_id
gender_name
referral_source
referral_source_id
referral_source_name
total_transaction_range
total_transaction_range_id
total_transaction_range_name
country
country_id
country_code
country_name
transaction_frequency
transaction_frequency_id
transaction_frequency_name
transaction
transaction_id
number_of_first_transactions
number_of_second_transactions
number_of_third_transactions
number_of_subsequent_transactions
number_of_transactions_with_voucher
number_of_first_transactions_with_voucher
number_of_money_transfer_transactions
number_of_airtime_transactions
number_of_on_hold_transactions
number_of_pending_transactions
number_of_paid_transactions
is_repeat_customer_id
transaction_status_fk
cancellation_status_fk
customer_fk
user_fk
sender_city_fk
receiver_city_fk
sender_currency_fk
receiver_currency_fk
correspondent_fk
voucher_fk
corridor_fk
payment_method_fk
receive_method_fk
transaction_rank_fk
bank_fk
sent_amount_range_fk
sent_amount_airtime
receive_amount_creation
receive_amount_payout
total_to_pay
fx_gain
fees
fx_rate_gbp_to_sent_amount
fx_rate_gbp_to_receive_amount_creation_date
fx_rate_gbp_to_receive_amount_payout_date
fx_rate_sent_to_receive
bank
bank_id
bank_name
transaction_rank
transaction_rank_id
transaction_rank_name
transaction_rank_group_id
transaction_rank_group_name
cancellation_status
cancellation_status_id
cancellation_status_name correspondent
correspondent_id
correspondent_name
transaction_city
transaction_city_id
transaction_city_name
transaction_state_id
transaction_state_name
transaction_country_fk
transaction_country_name
transaction_country_code
transaction_capital_latitude
transaction_capital_longitude
currency
currency_id
currency_code
currency_name
voucher
voucher_id
voucher_name
voucher_type_id
voucher_type_name
voucher_percentage_id
voucher_percentage_name
voucher_receive_method_group_id
voucher_receive_method_group_name
voucher_start_date_fk
voucher_end_date_fk
voucher_duration_days_id
voucher_duration_days_name
voucher_duration_range_id
voucher_duration_range_name
payment_method
payment_method_id
payment_method_name
payment_method_group_id
payment_method_group_name
Real life schemas II
sent_amount_range
sent_amount_range_id
origin_currency_fk
sent_amount_range_name
range_lower_limit
range_upper_limit
receive_method
receive_method_id
receive_method_name
receive_method_group_id
receive_method_group_name
receive_service_id
transaction_status
transaction_status_id
transaction_status_name
foreign_exchange_rate
foreign_exchange_rate_id
day_fk
sender_currency_fk
receiver_currency_fk
foreign_exchange_rate
foreign_exchange_rate_without_markup
voucher_usage_fact
day_fk
voucher_fk
voucher_duration_days_id
voucher_duration_days_name
voucher_duration_range_id
voucher_duration_range_name
voucher_start_date_fk
voucher_end_date_fk
voucher_is_money_transfer_id
voucher_is_airtime_id
voucher_is_valid_id
voucher_is_used_id
voucher_receive_method_id
voucher_receive_method_name
number_of_customers
number_of_first_transactions
fees
fx_gain
sent_amount_airtime
transaction_event_date
transaction_event_fk
day_fk
transaction_event_time_perspective_fk
transaction_event
transaction_event_id
number_of_transaction_events
transaction_fk
previous_status_fk
current_status_fk
hours_since_transaction
hours_to_next_event
sent_amount_airtime
fx_gain
fees
transaction_event_time_perspective
transaction_event_time_perspective_id
transaction_event_time_perspective_name
transaction_date
day_fk
transaction_fk
transaction_time_perspective_fk
transaction_duration
duration_fk
transaction_fk
transaction_time_perspective_fk
transaction_time_perspective
transaction_time_perspective_id
transaction_time_perspective_name
transaction_event_duration
transaction_event_fk
duration_fk
transaction_event_duration_perspective_fk
8 / 25
‣ https://www.worldremit.com/
finished soon* project
* A Data Warehouse is never finished

order_item
order_item_id
processed_order_item_id
is_original_id
is_print_id
processed_product_id
order_fk
product_fk
price_range_fk
order_process_fk
option_fk
fulfillment_provider_fk
refund_reason_fk
gross_revenue_item
net_item_price
net_item_price_first_order
vat_amount
net_shipping_revenue_first_order
duties_amount
gross_revenue_item_option
net_option_price
net_option_price_first_order
net_payment_cost
net_option_cost
net_printing_cost
net_voucher_amount_saatchi_share
net_voucher_amount_artist_share
net_voucher_amount_saatchi_share_first_order
net_voucher_amount_artist_share_first_order
artist_commission
artist_royalties
estimated_net_revenue_after_vouchers
origin_country_iso2
origin_latitude
origin_longitude
destination_latitude
destination_longitude
artwork
artwork_id
artist_fk
showdown_fk
artwork_category_fk
artwork_subject_fk
artwork_is_curated_fk
artwork_is_user_collection_fk
artwork_is_admin_collection_fk
artwork_related_fk
artwork_sale_category_fk
artwork_for_sale_as_print_fk
artwork_for_sale_as_original_fk
date_uploaded_fk
artwork_in_showdown_fk
artwork_in_weekly_roundup_fk
artwork_is_visible
artwork_is_in_curated
artwork_is_in_user_collection
artwork_is_in_admin_collection
user_collections_per_artwork
admin_collections_per_artwork
url
title
styles
artist_name
artist_first_name
artist_last_name
option
option_id
option_name
artwork_for_sale_as_original
artwork_for_sale_as_original_id
artwork_for_sale_as_original_name
Real life schemas III
artwork_category
artwork_category_id
artwork_category_name
artwork_in_showdown
artwork_in_showdown_id
artwork_in_showdown_name
artwork_for_sale_as_print
artwork_for_sale_as_print_id
artwork_for_sale_as_print_name
artwork_in_weekly_roundup
artwork_in_weekly_roundup_id
artwork_in_weekly_roundup_name
artwork_is_admin_collection
artwork_is_admin_collection_id
artwork_is_admin_collection_name
artwork_is_curated
artwork_is_curated_id
artwork_is_curated_name
artwork_is_user_collection
artwork_is_user_collection_id
artwork_is_artwork_related user_collection_name
artwork_related_id
artwork_related_name
artwork_sale_category
artwork_sale_category_id
artwork_sale_category_name
artwork_subject
artwork_subject_id
artwork_subject_name
round
round_id
showdown_id
showdown_round
showdown_title_sort_id
showdown_title
user
user_id
user_type_fk
user_status_fk
user_city_fk
artist_with_artwork_for_sale_id
artist_with_artwork_uploaded_id
user_name
user_first_name
user_last_name
email
number_of_weekly_roundup
number_of_showdown
number_of_artwork_comments
number_of_collection_comments
number_of_artworks_in_user_collections
number_of_user_likes
number_of_collection_favourites
number_of_user_logins
number_of_messages_sent
number_of_uploads
hours_to_first_upload
number_of_bought_items
number_of_originals_bought
number_of_prints_bought
number_of_orders_made
net_item_price_bought
net_item_revenue_bought
gross_revenue_after_vouchers_bought
net_revenue_after_vouchers_bought
net_voucher_cost_bought
number_of_sold_items
number_of_originals_sold
number_of_prints_sold
number_of_orders_sold
net_item_price_sold
net_item_revenue_sold
net_voucher_cost_sold
product
product_id
sku
artwork_fk
product_category_fk
substrate_fk
product_category
product_category_id
product_category_name
edition_type
substrate
substrate_id
substrate_name
collection_artwork_order_item
collection_artwork_order_item_id
collection_fk
artwork_fk
order_item_fk
collection
collection_id
collection_name
user_fk
collection_type_fk
collection_detailed_type_fk
date_created_fk
date_initiated_fk
artwork_style_mapping
artwork_fk
artwork_style artwork_style_fk
artwork_style_id
artwork_style_name
artwork_in_collection
artwork_fk
collection_fk
sales_event_duration_perspective
sales_event_duration_perspective_id
sales_event_duration_perspective_name
sales_time_perspective
sales_time_perspective_id
sales_time_perspective_name
collection_artwork_order_item_date
collection_artwork_order_item_fk
day_fk
collection_artwork_order_item_time_perspective_fk
day
day_id
day_name
year_id
iso_year_id
quarter_id
quarter_name
month_id
month_name
week_id
week_name
day_of_the_month
number_of_days_in_month
iso_date
collection_artwork_order_item_time_perspective
collection_artwork_order_item_time_perspective_id
collection_artwork_order_item_time_perspective_name
collection_detailed_type
collection_detailed_type_id
collection_detailed_type_name
collection_type
collection_type_id
collection_type_name
campaign_click_date
campaign_click_fk
day_fk
online_marketing_time_perspective_fk
campaign_click
campaign_click_id
campaign_fk
search_phrase_fk
referrer_fk
user_fk
number_of_clicks
number_of_daily_visits
number_of_monthly_visits
number_of_new_visits
number_of_daily_visitors
number_of_monthly_visitors
subsequent_registration_fk
subsequent_confirmation_fk
subsequent_first_order_fk
subsequent_order_fk
direct_cost
cost_of_campaigns_without_clicks
unmatched_cost
visit_duration
online_marketing_time_perspective
online_marketing_time_perspective_id
online_marketing_time_perspective_name
email_event_date
email_event_fk
day_fk
email_time_perspective_fk
email_event
email_event_id
email_list_fk
email_campaign_fk
email_recipient_fk
subscribe
unsubscribe
email_unsubscribe_reason_fk
sent
bounce_soft
bounce_hard
open
first_open
click
first_click
subsequent_order
subsequent_first_order
items
net_item_price
net_option_price
net_voucher_amount_saatchi_share
net_voucher_amount_artist_share
email_time_perspective
email_time_perspective_id
email_time_perspective_name
transactional_mail
number_of_mails_sent
transactional_mail_type_fk
day_fk
transactional_mail_type
transactional_mail_type_id
transactional_mail_type_name
sales_event_date
sales_event_fk
day_fk
sales_event_date_perspective_fk
sales_event
sales_event_id
order_item_fk
order_item_current_status_fk
order_timestamp
event_timestamp
hours_since_order
hours_to_next_event
effected_net_revenue_after_vouchers
estimated_net_revenue_after_vouchers
sales_event_date_perspective
sales_event_date_perspective_id
sales_event_date_perspective_name
order_date
day_fk
order_fk
order_date_perspective_fk
order
order_id
order_increment_id
processed_order_id
is_first_order_id
is_second_order_id
is_second_or_subsequent_order_id
order_with_voucher_id
user_fk
returning_buyer_fk
hour_of_day_fk
voucher_fk
payment_method_fk
payment_provider_fk
shipping_city_fk
order_source_fk
order_date_perspective
order_date_perspective_id
order_date_perspective_name
sales_event_duration
sales_event_fk
duration_fk
sales_event_duration_perspective_fk
duration
duration_id
days
days_name
weeks
weeks_name
months
months_name
quarters
quarters_name
five_day_period
five_day_period_name
order_duration
order_fk
duration_fk
sales_time_perspective_fk
fulfillment_provider
fulfillment_provider_id
fulfillment_provider_name
order_item_status
order_item_status_id
order_item_status_sort_id
order_item_status_name
order_process
order_process_id
order_process_name
checkout_type_id
checkout_type
fulfillment_type_id
fulfillment_type
price_range
price_range_id
price_range_name
refund_reason
refund_reason_id
refund_reason_name
refund_code_id
hour_of_day
hour_of_day_id
hour_of_day_name
order_source
order_source_id
order_source_name
payment_method
payment_method_id
payment_method_name
payment_provider
payment_provider_id
payment_provider_name
shipping_city
shipping_city_id
shipping_city_name
shipping_country_id
shipping_country_name
voucher
voucher_id
voucher_name
order_item_status_partition
order_item_status_partition_id
order_item_status_perspective_id
order_item_status_perspective_name
order_item_status_group_id
order_item_status_group_name
order_item_refunds
order_item_refunds_id
order_item_fk
refund_code_id
refund_code
refund_desc
refund_amount
refund_date
refund_comment
order_item_status_mapping
email_campaign
email_campaign_id
email_campaign_name
email_list_fk
email_unsubscribe_reason
email_unsubscribe_reason_id
email_unsubscribe_reason_name
email_recipient
email_recipient_id
email
email_recipient_location_fk
email_list
email_list_id
email_list_name
email_recipient_location
email_recipient_location_id
country_id
country_name
region_id
region_name
latitude
longitude
user_city
user_city_id
user_city
user_country_id
user_country
user_status
user_status_id
user_status_name
user_type
user_type_id
user_type_name
user_event_date_registration
user_event_fk
day_fk
user_event_time_perspective_fk
user_event
user_event_id
user_fk
user_type_fk
user_event_date
registration_date
weekly_roundup
showdown
artwork_comment
collection_comment
artwork
user_likes
collection_favourite
user_login
message_sent
artwork_upload
artwork_for_sale_as_print
artwork_for_sale_as_original
artwork_for_sale_as_both_print_and_original
artwork_for_sale_as_either_print_or_original
signup
verified_signup
user_order
time_since_signup
time_since_last_order
user_event_date_event
user_event_fk
day_fk
user_event_time_perspective_fk
referrer
referrer_id
referrer_name
referrer_type_name
campaign
campaign_id
campaign_name
campaign_code
channel_id
channel_name
is_brand_id
is_brand_name
partner_or_adwords_account_id
partner_or_adwords_account_name
publication_or_adwords_campaign_id
publication_or_adwords_campaign_name
wmc_or_adwords_adgroup_id
wmc_or_adwords_adgroup_name
search_phrase
search_phrase_id
search_phrase_name
search_phrase_type_name
user_date
user_fk
day_fk
sales_time_perspective_fk
campaign_click_position
campaign_click_fk
conversion_type_fk
campaign_click_performance
campaign_click_fk
conversion_type_fk
number_of_registrations
number_of_leads
number_of_orders
number_of_received_orders
number_of_first_orders
number_of_orders_with_voucher
net_order_revenue performance_attribution_model
performance_attribution_model_id
performance_attribution_model_name
9 / 25
‣ http://www.saatchiart.com/
exit August 2014

Data integration
‣ Visuals ETL tools
• many data source connectors
• hard to debug
• slow to change
Optimize for change speed!
‣ Start with simple sql queries & batch scripts
cat create-tables.sql | psql dwh!
!
cat load-order.sql !
| mysql --skip-column-names source_db !
| psql dwh --command="COPY tmp.order FROM STDIN !
!
!
!
!
‣ Later build something more robust
10 / 25
WITH NULL AS 'NULL'"!
!
cat /data/payment.csv !
| python payment_filter.py!
| psql dwh --command="COPY tmp.payment FROM STDIN” !
!
cat transform-order.sql | psql dwh!
!

Data integration in Yves & Zed
11 / 25
‣ Jobs = processing steps with dependencies
• parallel execution with cost based scheduler
• robust, transparent, no black boxes
‣ Parallel jobs & incremental processing
‣ Extensive visualisations & monitoring tools

Plain text files
‣ Very git-friendly
12 / 25
<?xml version="1.0" encoding="UTF-8"?>!
<process xmlns="http://project-a.com/dwh-process"!
id=“operational-data" ..>!
!
<initial-job id="initialize-schemas">!
<description>Recreates schemas and writes configs</description>!
<commands>!
..!
</commands>!
</initial-job>!
!
!
<job id="load-order">!
<description>Loads orders into tmp.order</description>!
<commands>!
<execute-sql-file file-name="orders/create-order-tmp-table.sql" echo-queries="true"/>!
<load-from-mysql file-name="orders/load-order.sql"!
target-table="tmp.order" database="app"!
timezone="UTC"/>!
<execute-sql>SELECT tmp.index_tmp_order();</execute-sql>!
</commands>!
</job>!
!
<job id="cleanse-order">!
<description>Deletes test orders and other invalid orders</description>!
<dependencies>!
<dependency job="cleanse-member"/>!
<dependency job="load-order-item"/>!
<dependency job="load-product"/>!

MDX = query language for multidimensional data
‣ Developed by Microsoft as part of Analysis Services
• http://en.wikipedia.org/wiki/MultiDimensional_eXpressions
‣
Each KPI is always computed in the same way
!
!
‣
13 / 25
SELECT !
TopCount([Product].[Product].Members, 2,!
[Measures].[Revenue])!
ON COLUMNS,!
[Measures].[Revenue]!
ON ROWS!
FROM [Pet sales]!
WHERE [Date].[Month].[Oct]
SELECT [Date].[Month].Members!
ON COLUMNS,!
CrossJoin({[Measures].[Sold items],!
[Measures].[# Orders], !
[Measures].[Revenue]},!
Descendants([Product].[All products]))!
ON ROWS!
FROM [Pet sales]
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
product_name

Mondrian = engine for executing MDX
‣ Open source analytics processor
• http://mondrian.pentaho.com
• http://en.wikipedia.org/wiki/Mondrian_OLAP_server
• In Java
• Eclipse Public License
• Active community
• https://github.com/pentaho/mondrian/
!
‣ Part of Pentaho BI platform
Open source business analytics
William D. Back
Nicholas Goodman
Julian Hyde
M A N N I N G
14 / 25
www.it-ebooks.info

Mondrian schema I
‣ The relation between fact tables and dimension tables is defined in a XML file
15 / 25
<Cube name="Pet sales" defaultMeasure="# Orders">!
<Table schema="dim" name="order_item"/>!
!
<Dimension name="Date" type="TimeDimension" foreignKey="day_fk">!
<Hierarchy allMemberName="All dates" hasAll="true" primaryKey="day_id">!
<Table schema="dim" name="day"/>!
<Level name="Month" column="month_id" nameColumn="month_name"!
type="Integer" levelType="TimeMonths" uniqueMembers="true"/>!
<Level name="Day" column="day_id" nameColumn="day_name"!
type="Integer" levelType="TimeDays" uniqueMembers="true"/>!
</Hierarchy>!
</Dimension>!
!
<Dimension name="Product" foreignKey="product_fk">!
<Hierarchy hasAll="true" allMemberName="All products" primaryKey="product_id">!
<Table schema="dim" name="product"/>!
<Level name="Product" column="product_id" nameColumn="product_name"!
type="Integer" uniqueMembers="true"/>!
</Hierarchy>!
</Dimension>!
!
..!
</Cube>
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
product_name

Mondrian schema II
‣ Measures as defined as aggregates on columns
Each KPI is always computed in the same way
!
!
!
!
‣ Mondrian = SQL query generator
16 / 25
ON COLUMNS,!
[Measures].[Avg cart value]!
ON ROWS!
FROM [Pet sales]
SELECT!
"day"."month_id" AS "c0",!
count(DISTINCT "order_item"."order_id") AS "m0",!
sum("order_item"."price") AS "m1"!
FROM!
"dim"."day" AS "day",!
"dim"."order_item" AS "order_item"!
WHERE!
"order_item"."day_fk" = "day"."day_id"!
GROUP BY!
"day"."month_id"
order_item
item_id
order_id
has_voucher
price
day_fk
product_fk
day
day_id
day_name
month_id
month_name
product
product_id
<Cube name="Pet sales" defaultMeasure="# Orders”>! product_name
..!
<Measure name="# Orders" column="order_id" datatype="Integer" aggregator="distinct-count" formatString="Standard"/>!
!
<Measure name="Revenue" column="price" datatype="Integer" aggregator="sum" formatString="Currency"/>!
!
<Measure name="Sold items" column="item_id" datatype="Integer" aggregator="count" formatString="Standard"/>!
!
<CalculatedMember name="Avg cart value" dimension="Measures">!
<Formula>[Measures].[Revenue] / [Measures].[# Orders]</Formula>!
</CalculatedMember>!
</Cube>!
!
➞ ➞

Mondrian schema III
‣ Everything about KPIs & dimensions (business) and
tables & columns (IT) in one file
• consistent & explicit semantics
• transparency is easy
Always draw your Mondrian schema!
17 / 25

Ad-hoc queries with Saiku Analytics
‣ Drag & drop reporting tool on top of Mondrian
• Open source (Apache 2.0)
• Talks to Mondrian via MDX
• http://meteorite.bi/saiku
Try it out immediately, it’s amazing: http://demo.analytical-labs.com/
18 / 25

Reports in Yves & Zed I
‣ Own lightweight reporting frontend
• bootstrap/ Google charts
• lacks many features
• features are easy to implement
Numbers are random! 19 / 25

Reports in Yves & Zed II
‣ Dashboard-like interactive reports
• maintained by developers
• each table / chart is an MDX query
Numbers are random!
20 / 25

XMLA = XML for Analysis = MDX via SOAP
‣ Industry standard originally proposed by Microsoft
• http://en.wikipedia.org/wiki/XML_for_Analysis
• Soap protocol to discover and query OLAP cubes
• Mondrian has an XMLA server
‣ Request
‣ Response
21 / 25
<SOAP-ENV:Envelope xmlns:SOAP-ENV=“..”>!
<SOAP-ENV:Header/>!
<SOAP-ENV:Body>!
<Execute xmlns="urn:schemas-microsoft-com:xml-analysis">!
<Command>!
<Statement>!
<![CDATA[!
ON COLUMNS,!
[Measures].[Avg cart value]!
ON ROWS!
FROM [Pet sales]!
]]>!
</Statement>!
</Command>!
<Properties>!
<PropertyList>!
<Catalog>dwh</Catalog>!
<DataSourceInfo>Monsai</DataSourceInfo>!
<Format>Multidimensional</Format>!
<SOAP-ENV:Envelope xmlns:SOAP-ENV="..">!
<SOAP-ENV:Header ../>!
<SOAP-ENV:Body>!
<cxmla:ExecuteResponse xmlns:cxmla="urn:schemas-microsoft-<cxmla:return>!
<root>!
<OlapInfo ../>!
<Axes>!
<Axis name=“Axis0" ../>!
<Axis name="Axis1">!
<Tuples>!
<Tuple>!
<Member Hierarchy=“Measures" ..>!
</Tuple>!
</Tuples>!
</Axis>!
<Axis name=“SlicerAxis" ../>!
</Axes>!
<CellData>!
<Cell CellOrdinal="0">!
<Value xsi:type="xsd:double">26.666666666666668</<FmtValue>26,67 €</FmtValue>!
<FormatString>Standard</FormatString>!
</Cell>!
<Cell CellOrdinal="1">!
<Value xsi:type="xsd:double">40</Value>!
<FmtValue>40,00 €</FmtValue>!
<FormatString>Standard</FormatString>!
</Cell>!

Data Warehouse in Yves & Zed
!
!
!
!
!
!
csv files
!
!
!
!
!
data integration monsai reporting
MDX results
database
mapping
!
‣ monsai = Mondrian XMLA Server + Saiku in a single war file, https://github.com/project-a/monsai
22 / 25
application
databases
json files
apis
SQL SQL
DB results
XMLA / MDX
Mondrian XMLA response
Mondrian schema

What kind of people do you need to hire for this?
‣ The “typical BI expert”:
• studied something related to business and learnt VBA
programming through Excel
• relies on others to set up databases and tools
‣ Your ideal candidate
• has studied computer science
• masters the basic tools of software development and
computer science
• likes to learn new technologies
• understands how databases work
‣ Good profile example:
http://www.project-a.com/en/careers/jobs/?yid=332
Job opportunity Data Engineer / Data Scientist (m/f) at Projec... https://karriere.project-a.com/eng?yid=For our "A-Team" we are looking to fill the following position as soon as possible
Data Engineer / Data Scientist (m/f)
Your tasks:
You will help our business intelligence team to build data driven applications for our ventures:
data warehouses, recommendation engines and CRM systems (developed in-house, based
on open-source technologies)
You will integrate, transform and index data from various data sources, develop meaningful
data representations and visualisations, and provide aggregated data for third-party systems
You will advance our software architecture and tool set to growing challenges and data
amounts (performance, scaling, data quality)
You will work in an agile software development process in close collaboration with a product
management team
Your profile:
You have a Master's degree in computer science or a comparable degree
You have a genuine interest in data and algorithms and you are excited about solving difficult
problems and strive for efficient and robust solutions
You master at least these basic tools of computer science: object oriented programming in
multiple languages, HTTP and current web technologies, the unix command line and basic
server administration, version control systems, a basic understanding of the interplay
between software and memory, hard discs and the CPU
You have profound knowledge about the inner workings of database systems
You are eager to delve into new technologies and programming languages (our current
stack: Mac or Linux, PostgreSQL, Mondrian & MDX, PHP, Java, Python, Solr, ElasticSearch,
R)
You have a basic understanding of mathematics and machine learning
Your chance:
23 You will join a highly professional and motivated team
You will have the unique opportunity to witness the launch of a newly established company
and you can contribute your own ideas to its development
Search for computer scientists, not business intelligence experts
/ 25

Use a standard software engineering process!
‣ Product managers: what?
• Collection of business requirements
• KPI & report definitions
• QA & analysis
!
Any kind of Scrum / Kanban works, do it
‣ Developers: how?
• Implementation, performance & stability
• Schema & process design
• Consistency checks
Avg
net
revenue
per
buying
member
%
Contribution
margin
1
24 / 25
Net
revenue
Net
voucher
cost
Avg
net
voucher
cost
per
order
Contribution
margin
3a
Tax
shipping
amount
Tax
amount
Gross
revenue
Avg
gross
item
price
Gross
price
to
gross
retail
price
ratio
Price
to
retail
price
ratio
Avg
gross
order
value
%
Gross
voucher
cost
Gross
invoiced
amount
Net
invoiced
amount
Gross
retail
price
Net
price
to
net
purchase
price
ratio
Net
price
to
net
retail
price
ratio
%
Net
discount
Avg
gross
invoiced
amount
HGB
net
revenue
margin
Avg
gross
voucher
cost
per
buying
member
Net
item
revenue
Tax
item
amount
Net
purchase
cost
Net
retail
price
Retail
tax
amount
Gross
voucher
cost
Net
shipping
revenue
Gross
shipping
revenue

Thank you
Data integration is easy if you keep things simple!
http://www.project-a.com/
25 / 25

How to build a data warehouse - code.talks 2014

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (11)

Destaque

Destaque (7)

Semelhante a How to build a data warehouse - code.talks 2014

Semelhante a How to build a data warehouse - code.talks 2014 (20)

Mais de Martin Loetzsch

Mais de Martin Loetzsch (6)

Último

Último (20)

How to build a data warehouse - code.talks 2014