3. The Problem
… but eventually
• Want granularity smaller than GA exposes
• Want analysis GA doesn’t support
• Want to combine and analyse data from different sources
4. Goal: answer 80% of
questions stemming from
data in 20min or less
5. The analytics chasm
2 min 20 min project
Ideal. Almost real-time. Can be
done during brainstorming
without disrupting the flow.
fail
Added to roadmapSqueeze in
somewhere
in the day
6. Levelling up
1.Acquire data (directly, or from 3rd party APIs)
2.Store it in a data warehouse
3.Transform it to a usable and unified shape
4.Perform analytics on it
7. Intermezzo: My perspective
• Core developer at Metabase, an open source BI/analytics
tool. 3rd largest BI tool in the world. 20k+ companies use
us daily, including N26, Revolut, Swisscom
• Built analytics department at GoOpti from the ground up
• Helped 20+ companies become data-driven
8. Levelling up
1.Acquire data (directly, or from 3rd party APIs)
2.Store it in a data warehouse
3.Transform it to a usable and unified shape
4.Perform analytics on it
9. Collecting requirements
1.Make a list of all the data sources you currently have, how much data is in
them (number of entities), and at what rate the data grows
2.Collect user stories from all potential users:
As a ______ I’d like to _________, because _________
3.Match each user story with needed data sources
4.Rank user stories using PIE (probability, impact, effort)
5.Rank data sources by summing the PIE score of all user stories that require it.
6.Build data infrastructure to enable the high-value cluster
7.Continue doing steps 1-6 as you iterate
10. A minimal data-collection
plan
• Event stream
• Goal: be able to reconstruct any given session from data
• Timestamp, session, action, payload, context/result
12. Extract-Load-Transform
• Dump data somewhere as soon as possible so you don’t
loose it.
• Databases are fast and powerful enough to do most
transforms there. In return you get:
• Observability
• Analysts become more self-sufficient (if they know SQL)
• For small-medium data size (< 1M data points/day)
more performant and requires much less infrastructure
13. Good ELT is:
• Repeatable
• Observable
• Extensible
• Scalable
• Recoverable (don’t loose data, ever!)
15. Identify principle axis of
your data
• User, account, transaction, instance, product, event (log)…
• There will (and should) be some overlap
• Different axis will have different granularity
• Some should be ordered in time
16. Data warehouse topology
• Big fat denormalised tables, one for each principle axis
• Use views to tailor the representation to your tools and
analysis needs
17. Which DB?
• Optimize for ease of ad-hoc querying
• Should be decently performant (waiting kills productivity)
but is unlikely to be the bottleneck
• Simple to deploy, connect to, and use
• Strong data validation/schemas, but should also handle
non-structured data (validation on load = data loss)
• Sane handling of timezones, date time arithmetics, &
numbers
18. My go-to stack
• Snowplow for event-like data
• Apache Airflow to manage the workflow
• (managed) Postgres for data warehouse (or Druid if only event data and a
lot of it)
• dbt for data transforms
• Metabase for analytics
• Fully open-source
• Extensible, performant
36. You can often encode
dynamic processes as
binary outcomes
37. Signal or noise?
• Trend & relative change often tell more than absolute
values Percentiles
• Intra- vs. inter-segment variance
• Significance tests
• Sample representativeness (is not just for A/B tests)
• Distribution similarity
• Have a reference point (and reference it often)
39. MESI
• Medical decices
• North-star metric: number of measurements/device
• Current data sources: GA, product database, countly,
sentry, hubspot, Odoo
40. MESI data acquisition
• Collect event stream from devices capturing all the interactions [Snowplow]
• Mirror product database into data warehouse [Airflow]
• Collect event stream from the website [Snowplow]
• Integrate Hubspot and Odoo via API [Airflow]
• Integrate sentry via API [Airflow]
• (Retire Countly)
• (Add support data — Jira, Zendesk, …)
• (Add accounting/billing)
41. MESI data warehouse
• (managed) Postgres
• Principle axis: account, user, device event, user journey
event, device
42. MESI analytics
• Metabase
• User journey before conversion
• Device usage patterns
• UX friction points
• Onboarding
• Errors & support issues
• Segmentation
44. SalesGenomics
• eCommerce marketing agency focused on scale-up
• Typical customer marketing budget 10k-100k/month
• Current data sources: GA, FB, Shopify
• 2-sided reporting: for clients, internal
45. SalesGenomics data
acquisition
• Custom event collector on websites (replacing GA
snippet) [Snowplow]
• Integrate Shoppify, AdWords, FB ads [Airflow]
— OR —
• Use Segment/Stitch Data
49. Starting from 0
• Setup GA (remember the minimal data-collection plan)
• Connect Metabase to your product DB
• Collect data user stories from day 1
• Focus analytics on user journey, segmentation, costs, & UX