SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
ClickHouse for
Experimentation
Gleb Kanterov
@kanterov
2018-07-03
170M Monthly Active Users
75M Subscribers
35M Tracks
65 Markets
[1] https://investors.spotify.com
Quick Facts
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
Organization
1 Company 10 Organizations 30+ Tribes 150+ Squads
Move fast, break things
Ask for forgiveness, not for permission
AUTONOMY
Hadoop@Spotify
● On-Premise
● 2,500 nodes
● 100 PB Disk
● 100 TB RAM
● 100B+ events per day
● 20K+ jobs per day
Hadoop@Spotify
● Migration from On-Premise to GCP
● Moved 100 PB of data
● Our Hadoop cluster is dead
Hadoop@Spotify
What are
experiments,
and why
ClickHouse?
Randomized
Controlled
Experiment
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
All Khan Academy content is available for free at www.khanacademy.org
Randomized
Controlled
Experiment
An experiment where all subjects
involved in the experiment are treated
the same except for one deviation.
One variable is changed in order to
isolate the results.
All Khan Academy content is available for free at www.khanacademy.org
A/B Testing
A/B Testing is a randomized controlled experiment where one variable is tested.
E.g., hypothesis Our new recommendation algorithm increases content consumption.
How to verify?
1. Formulate hypothesis
2. Run A/B test
3. See if there is a statistically significant increase in consumption.
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
1. Event Delivery
Developers instrument
applications and services
using SDK.
Events are collected and
published to Pub/Sub.
Batch jobs read data from
Pub/Sub, deduplicate and
anonymize, and then store in
hourly partitions on GCS.
Exposing users to
experiments, and configuring
A/B variations on clients is
done by dedicates services.
Product
Owners
Data
Scientists
Granular Data
BigQuery
1
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
2. Data Pipelines and
Storage
Data gets transformed and
aggregated using Dataflow
batch jobs, and stored in
Bigtable, GCS and BigQuery.
Bigtable contains
pre-computed aggregated
experiment results.
BigQuery has granular data
used in ad-hoc analysis.
Product
Owners
Data
Scientists
Granular Data
BigQuery
2
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
3. Presentation
Users of Experimentation
platform see their experiment
results in web application.
Statistical tests and health
checks are performed
automatically.
Metrics for Experimentation Platform v1
Product
Owners
Data
Scientists
Granular Data
BigQuery
3
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
4. Ad-hoc Analytics
Data scientists do ad-hoc
exploration in Jupyter
notebooks using BigQuery.
Here they answer experiment
specific-questions, not
automatically supported by
experimentation system.
Product
Owners
Data
Scientists
Granular Data
BigQuery
4
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
What works well
Centralized team owning
100-s of core metrics.
Automatic experiment
analysis and planning.
Allows to conclude
experiments without manual
analysis. Autonomous feature
teams can move fast and
iterate on their product.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Aggregated Data
Cloud Bigtable
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Problems
Not every metric worths
centralization.
Centralized team became a
bottleneck for Feature
features.
As a result, too much
repetitive work goes into
notebooks and ad-hoc
queries.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Product TeamsEvent Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Storage
Granular Data
OLAP Database
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Compute Engine
Metrics for Experimentation Platform v1
Reasons
1. Experimentation isn’t only
about hypothesis testing, but
learning from experiments.
Aggregated data in Bigtable
wasn’t granular enough, and
didn't have enough
dimensions.
2. Can’t add a new metric
without involving a central
team.
What we want
Provide teams more granular
data out of the box, and give
a way to define a new metric.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Requirements
● Serve 100-s of QPS with sub-second latency
● We know in advance what are queries and data
● Maintain 10x metrics with the same cost
● Thousands of metrics
● Billions of rows per day in each of 100-s of tables
● Ready to be used out of the box
● Leverage existing infrastructure as much as feasible
● Hide unnecessary complexity from internal users
What about BigQuery?
● Supports Standard SQL
● Don’t have to optimize datasets in advance
● Works great for heavy queries with joins among multiple datasets
● Doesn’t need operations and machines running
● Good for interactive ad-hoc queries (~ minutes)
● Isn’t best for a high amount of low-latency queries you are aware in advance
Why ClickHouse?
● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...)
● ClickHouse has the most simple architecture
● Powerful SQL dialect close to Standard SQL
● A comprehensive set of built-in functions and aggregators
● Was ready to be used out of the box
● Superset integration is great
● Easy to query using clickhouse-jdbc and jooq
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
5. ClickHouse
Interactive queries on
granular data.
Reduce demand in notebooks
and BigQuery with
dashboards and exploration
in Superset.
Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
5
Event Delivery
Cloud
Pub/Sub
Cloud
Storage
Cloud
Dataproc
Data Pipelines
Cloud
Dataflow
Product TeamsStorage
Granular Data
ClickHouse
Granular Data
Cloud Storage
Ad-hoc Analytics
Presentation
Web Application
Metrics for Experimentation Platform v2
6. Metrics Catalog
Centralized place for teams to
define their own metrics.
20 minutes to define a metric. Product
Owners
Data
Scientists
Granular Data
BigQuery
Superset
Metrics Catalog
Metrics API
Metric
definitions
6
What we have built
● Own DSL to define metrics, and centralized metrics catalog
● Expressive and simple model that we can efficiently scale to 1000-s of metrics
● Generalize existing components to work with Metrics DSL
○ data preparation and ingestion into ClickHouse
○ denormalization with conformed dimensions
○ create dashboards, tables and charts in Superset
○ do statistical tests, and expose results through API
○ define ownership, tiering, and other attributes
○ integrates with the rest of infrastructure for alerting, monitoring,
data quality, anomaly detection, access control & etc
● Users don’t work with ClickHouse SQL, or need to know how it works
● API to query metrics and metadata
Ingestion to ClickHouse
● Move data from GCS to ClickHouse
● Use clickhouse-jdbc, custom code and RowBinary format
● Use daily partitioning, and ingest once a day
● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0
● Don’t use materialized views in ClickHouse
● Offload most of computations to batch data pipelines due to scalability, experience and
tooling
● TODO try ClickHouse-Native-JDBC
● TODO pre-sort in data pipelines before ingesting
What is next
● Do lambda-style ingestion for subset of metrics with low-latency requirements
● Add more aggregations to DSL (e.g. 5 statistical moments)
● Add custom chart types to Superset
● Try ClickHouse for similar use cases within Spotify
Using ClickHouse for Experimentation

Mais conteúdo relacionado

Mais procurados

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEOAltinity Ltd
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...HostedbyConfluent
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseAltinity Ltd
 
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseAltinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaAltinity Ltd
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
 

Mais procurados (20)

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEODangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
 
High Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouseHigh Performance, High Reliability Data Loading on ClickHouse
High Performance, High Reliability Data Loading on ClickHouse
 
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 

Semelhante a Using ClickHouse for Experimentation

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle Cloud Platform - Japan
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...Márton Kodok
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperMárton Kodok
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analyticsMariaDB plc
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use CasesTatvic Analytics
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptxHarissh16
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Romit Mehta
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarRasel Rana
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentationMichael Young
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 

Semelhante a Using ClickHouse for Experimentation (20)

Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018Gimel at Teradata Analytics Universe 2018
Gimel at Teradata Analytics Universe 2018
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
 
OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024OpenMetadata Community Meeting - 4th April, 2024
OpenMetadata Community Meeting - 4th April, 2024
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Big query
Big queryBig query
Big query
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
 

Último

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationShrmpro
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Último (20)

%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Using ClickHouse for Experimentation

  • 2. 170M Monthly Active Users 75M Subscribers 35M Tracks 65 Markets [1] https://investors.spotify.com Quick Facts
  • 3. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads
  • 4. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things
  • 5. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission
  • 6. Organization 1 Company 10 Organizations 30+ Tribes 150+ Squads Move fast, break things Ask for forgiveness, not for permission AUTONOMY
  • 8. ● On-Premise ● 2,500 nodes ● 100 PB Disk ● 100 TB RAM ● 100B+ events per day ● 20K+ jobs per day Hadoop@Spotify
  • 9. ● Migration from On-Premise to GCP ● Moved 100 PB of data ● Our Hadoop cluster is dead Hadoop@Spotify
  • 12. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 13. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 14. Randomized Controlled Experiment All Khan Academy content is available for free at www.khanacademy.org
  • 15. Randomized Controlled Experiment An experiment where all subjects involved in the experiment are treated the same except for one deviation. One variable is changed in order to isolate the results. All Khan Academy content is available for free at www.khanacademy.org
  • 16. A/B Testing A/B Testing is a randomized controlled experiment where one variable is tested. E.g., hypothesis Our new recommendation algorithm increases content consumption. How to verify? 1. Formulate hypothesis 2. Run A/B test 3. See if there is a statistically significant increase in consumption.
  • 17. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery
  • 18. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 1. Event Delivery Developers instrument applications and services using SDK. Events are collected and published to Pub/Sub. Batch jobs read data from Pub/Sub, deduplicate and anonymize, and then store in hourly partitions on GCS. Exposing users to experiments, and configuring A/B variations on clients is done by dedicates services. Product Owners Data Scientists Granular Data BigQuery 1
  • 19. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 2. Data Pipelines and Storage Data gets transformed and aggregated using Dataflow batch jobs, and stored in Bigtable, GCS and BigQuery. Bigtable contains pre-computed aggregated experiment results. BigQuery has granular data used in ad-hoc analysis. Product Owners Data Scientists Granular Data BigQuery 2
  • 20. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine 3. Presentation Users of Experimentation platform see their experiment results in web application. Statistical tests and health checks are performed automatically. Metrics for Experimentation Platform v1 Product Owners Data Scientists Granular Data BigQuery 3
  • 21. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 4. Ad-hoc Analytics Data scientists do ad-hoc exploration in Jupyter notebooks using BigQuery. Here they answer experiment specific-questions, not automatically supported by experimentation system. Product Owners Data Scientists Granular Data BigQuery 4
  • 22. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 What works well Centralized team owning 100-s of core metrics. Automatic experiment analysis and planning. Allows to conclude experiments without manual analysis. Autonomous feature teams can move fast and iterate on their product. Product Owners Data Scientists Granular Data BigQuery
  • 23. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Aggregated Data Cloud Bigtable Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Problems Not every metric worths centralization. Centralized team became a bottleneck for Feature features. As a result, too much repetitive work goes into notebooks and ad-hoc queries. Product Owners Data Scientists Granular Data BigQuery
  • 24. Product TeamsEvent Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Storage Granular Data OLAP Database Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Compute Engine Metrics for Experimentation Platform v1 Reasons 1. Experimentation isn’t only about hypothesis testing, but learning from experiments. Aggregated data in Bigtable wasn’t granular enough, and didn't have enough dimensions. 2. Can’t add a new metric without involving a central team. What we want Provide teams more granular data out of the box, and give a way to define a new metric. Product Owners Data Scientists Granular Data BigQuery
  • 25. Requirements ● Serve 100-s of QPS with sub-second latency ● We know in advance what are queries and data ● Maintain 10x metrics with the same cost ● Thousands of metrics ● Billions of rows per day in each of 100-s of tables ● Ready to be used out of the box ● Leverage existing infrastructure as much as feasible ● Hide unnecessary complexity from internal users
  • 26. What about BigQuery? ● Supports Standard SQL ● Don’t have to optimize datasets in advance ● Works great for heavy queries with joins among multiple datasets ● Doesn’t need operations and machines running ● Good for interactive ad-hoc queries (~ minutes) ● Isn’t best for a high amount of low-latency queries you are aware in advance
  • 27. Why ClickHouse? ● Build proof of concept using various OLAP storages (ClickHouse, Druid, Pinot, ...) ● ClickHouse has the most simple architecture ● Powerful SQL dialect close to Standard SQL ● A comprehensive set of built-in functions and aggregators ● Was ready to be used out of the box ● Superset integration is great ● Easy to query using clickhouse-jdbc and jooq
  • 28. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 5. ClickHouse Interactive queries on granular data. Reduce demand in notebooks and BigQuery with dashboards and exploration in Superset. Product Owners Data Scientists Granular Data BigQuery Superset 5
  • 29. Event Delivery Cloud Pub/Sub Cloud Storage Cloud Dataproc Data Pipelines Cloud Dataflow Product TeamsStorage Granular Data ClickHouse Granular Data Cloud Storage Ad-hoc Analytics Presentation Web Application Metrics for Experimentation Platform v2 6. Metrics Catalog Centralized place for teams to define their own metrics. 20 minutes to define a metric. Product Owners Data Scientists Granular Data BigQuery Superset Metrics Catalog Metrics API Metric definitions 6
  • 30. What we have built ● Own DSL to define metrics, and centralized metrics catalog ● Expressive and simple model that we can efficiently scale to 1000-s of metrics ● Generalize existing components to work with Metrics DSL ○ data preparation and ingestion into ClickHouse ○ denormalization with conformed dimensions ○ create dashboards, tables and charts in Superset ○ do statistical tests, and expose results through API ○ define ownership, tiering, and other attributes ○ integrates with the rest of infrastructure for alerting, monitoring, data quality, anomaly detection, access control & etc ● Users don’t work with ClickHouse SQL, or need to know how it works ● API to query metrics and metadata
  • 31. Ingestion to ClickHouse ● Move data from GCS to ClickHouse ● Use clickhouse-jdbc, custom code and RowBinary format ● Use daily partitioning, and ingest once a day ● 1 hour to ingest 5 TiB on test cluster using 9 n1-standard-32 with 8 NVMe SSD RAID0 ● Don’t use materialized views in ClickHouse ● Offload most of computations to batch data pipelines due to scalability, experience and tooling ● TODO try ClickHouse-Native-JDBC ● TODO pre-sort in data pipelines before ingesting
  • 32. What is next ● Do lambda-style ingestion for subset of metrics with low-latency requirements ● Add more aggregations to DSL (e.g. 5 statistical moments) ● Add custom chart types to Superset ● Try ClickHouse for similar use cases within Spotify