SlideShare uma empresa Scribd logo
1 de 50
How we evolved data pipeline at Celtra
and what we learned along the way
whoami
Grega Kespret
Director of Engineering,
Analytics @
San Francisco
@gregakespret
github.com/gregakespret
slideshare.net/gregak
Data Pipeline
Big Data Problems vs. Big Data Problems
This is not going to be a talk about Data Science
Creative Management Platform
700,000 Ads Built
70,000 Campaigns
5000 Brands
6G Analytics Events / Day
1TB Compressed Data / Day
Celtra in Numbers
300 Metrics
200 Dimensions
Growth
The analytics platform at Celtra has experienced tremendous growth over
the past few years in terms of size, complexity, number of users, and
variety of use cases.
Business Data Volume Data ComplexityImpressions
• INSERT INTO impressions in track.php => 1 row per impression
• SELECT creativeId, placementId, sdk, platform, COUNT(*) sessions
FROM impressions
WHERE campaignId = ...
GROUP BY creativeId, placementId, sdk, platform
• It was obvious this wouldn't scale, pretty much an anti-pattern
MySQL for everything
Trackers Client-facing
dashboards
REST APIRead replicaRaw impressions
GET /api/analytics?
metrics=sessions,creativeLoads,sessionsWithInteraction
&dimensions=campaignId,campaignName
&filters.accountId=d4950f2c
&filters.utcDate.gte=2018-01-01
&filters.utcDate.lt=2018-04-01
&sort=-sessions&limit=25&format=json
REST API
Trackers Client-facing
dashboards
REST APIRead replicaRaw impressions
BigQuery (private beta)
Trackers Raw impressions Client-facing
dashboards
REST API
Hive + Events + Cube aggregates
Trackers Events
S3
Hive Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
ETL
• Separated write and read paths
• Write events to S3, read and process with Hive, store to MySQL,
asynchronously
• Tried & tested, conservative solution
• Conceptually almost 1:1 to BigQuery (almost the same SQL)
Events
• Immutable, append-only set of raw data
• Store on S3
s3://celtra-mab/events/2017-09-22/13-15/3fdc5604/part0000.events.gz
• Point in time facts about what happened
• Bread and butter of our analytics data
• JSON, one event per line
- server & client timestamp
- crc32 to detect invalid content
- index to detect missing events
- instantiation to deduplicate events
- accountId to partition the data
- name, implicitly defines the schema (very sparse)
Event data
• Replaced Hive with Spark (the new kid on the block)
• Spark 0.5 in production
• sessionization
Spark + Events + Cube aggregates
Trackers Raw Events
S3
Events
Analyzer Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries
== Combining discrete events into sessions
• Complex relationships between events
• Patterns more interesting than sums and counts of events
• Easier to troubleshoot/debug with context
• Able to check for/enforce causality (if X happened, Y must also have happened)
• De-duplication possible (no skewed rates because of outliers)
• Later events reveal information about earlier arriving events (e.g. session duration, attribution, etc.)
Sessionization
Sessionization
Events Sessions
• Expressive computation layer: Provides nice functional abstractions over distributed collections
• Get full expressive power of Scala for ETL
• Complex ETL: de-duplicate, sessionize, clean, validate, emit facts
• Shuffle needed for sessionization
• Seamless integration with S3
• Speed of innovation
Spark
(we don't really use groupByKey any more)
Simplified Data Model
Session Unit view Page view
Creative
Campaign
Interaction
sdk time supplier ... unit
name
... name time on page ...
Account
Placement
1. Denormalize
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize
✗ Speed (joins)
✓ Cheap storage
Denormalize vs. normalize
• Idea: store sessions (unaggregated data after sessionization) instead of cubes
• Columnar MPP database
• Normalized logical schema, denormalized physical schema (projections)
• Use pre-join projections to move the join step into the load
Vertica
Trackers Raw Events
S3
Analyzer Client-facing
dashboards
REST APIRaw Sessions
8 VPC nodes
Hash Joins vs. Merge Joins
• Always use MERGE JOINS (as opposed to HASH JOINS)
• Great performance and fast POC => start with implementation
• Pre-production load testing => a lot of problems:
- Inserts do no scale (VER-25968)
- Merge-join is single-threaded (VER-27781)
- Non-transitivity of pre-join projections
- Remove column problem
- Cannot rebuild data
- ...
• Decision to abandon the project
Vertica
Trackers Raw Events
S3
Analyzer Client-facing
dashboards
REST APIRaw Sessions
• Complex setup and configuration required
• Analyses not reproducible and repeatable
• No collaboration
• Moving data between different stages in troubleshooting/analysis lifecycle (e.g. Scala for aggregations, R for viz)
• Heterogeneity of the various components (Spark in production, something else for exploratory data analysis)
• Analytics team (3 people) bottleneck
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
Analyzer Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries
• Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Visualizations
• For example:
• Analyzing effects of placement position on engagement rates
• Troubleshooting 95th percentile of ad loading time performance
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries
Analyzer
• Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Visualizations
• For example:
• Analyzing effects of placement position on engagement rates
• Troubleshooting 95th percentile of ad loading time performance
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Analyzer
✗ Complex ETL repeated in adhoc queries (slow, error-prone)
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Keep cubes small (no geo, hourly, external dimensions)
✗ Recompute cube from events to add new breakdowns to existing metric
(slow, not exactly deterministic)
Some of the problems
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Analyzer
Idea: Split ETL, Materialize Sessions
deduplication, sessionization, cleaning, validation, external dependencies
aggregating across different dimensions
Part 1: Complex Part 2: Simple
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Raw Sessions
ETL ETL
Analyzer
Requirements
• Fully managed service
• Columnar storage format
• Support for complex nested structures
• Schema evolution possible
• Data rewrites possible
• Scale compute resources separately
from storage
Data Warehouse to store sessions
Nice-to-Haves
• Transactions
• Partitioning
• Skipping
• Access control
• Appropriate for OLAP use case
Final
Contenders
for New
Data
Warehouse
• Too many small files problem => file stitching
• No consistency guarantees over set of files on
S3 => secondary index | convention
• Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog),
Storage format layer (Parquet), Data layer (S3)
Spark + HCatalog + Parquet + S3
We really wanted a database-like abstraction with
transactions, not a file format!
Operational tasks with Vertica:
• Replace failed node
• Refresh projection
• Restart database with one node down
• Remove dead node from DNS
• Ensure enough (at least 2x) disk space
available for rewrites
• Backup data
• Archive data
Why We Wanted a Managed Service
We did not want to
deal with these tasks
1. Denormalize
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize
✗ Speed (joins)
✓ Cheap storage
3. Normalized logical schema, denormalized physical schema (Vertica use case)
✓ Speed (move the join step into the load)
✓ Cheap storage
4. Nested objects: pre-group the data on each grain
✓ Speed (a "join" between parent and child is essentially free)
✓ Cheap storage
Support for complex nested structures
{
"id": "s1523120401x263215af605420x35562961",
"creativeId": "f21d6f4f",
"actualDeviceType": "Phone",
"loaded": true,
"rendered": true,
"platform": "IOS",
...
"unitShows": [
{
"unit": { ... },
"unitVariantShows": [
{
"unitVariant": { ... },
"screenShows": [
{
"screen": { ... },
},
{
"screen": {... },
}
],
"hasInteraction": false
}
],
"screenDepth": "2",
"hasInteraction": false
}
]
}
Pre-group the data on each grain
Flat + Normalized Nested
Find top 10 pages on creative units with most interactions on average
Flat vs. Nested Queries
SELECT
creativeId,
us.name unitName,
ss.name pageName,
AVG(COUNT(*)) avgInteractions
FROM
sessions s
JOIN unitShows us ON us.id = s.id
JOIN screenShows ss ON ss.usid = us.id
JOIN interactions i ON i.ssid = ss.id
GROUP BY 1, 2, 3
ORDER BY avgInteractions DESC LIMIT 10
Distributed join turned into local joinJoins
Requires unique ID at every grain
SELECT
creativeId,
unitShows.value:name unitName,
screenShows.value:name pageName,
AVG(ARRAY_SIZE(screenShows.value:interactions)) avgInteractions
FROM
sessions,
LATERAL FLATTEN(json:unitShows) unitShows,
LATERAL FLATTEN(unitShows.value:screenShows) screenShows
GROUP BY 1, 2, 3
ORDER BY avgInteractions DESC LIMIT 10
Sessions: Path to production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
ReportsAdhoc queries
Raw Sessions
ETL
S3
1. Backfilling
- Materialized historical sessions of last 2 years and inserted into
Snowflake
- Verify counts for each day versus cube aggregates, investigate
discrepancies & fix
Analyzer
2. "Soft deploy", period of mirrored writes
- Add SnowflakeSink to Analyzer (./bin/analyzer --sinks
mysql,snowflake)
- Sink to Cubes and Snowflake in one run (sequentially)
- Investigate potential problems of snowflake sink in production,
compare the data, etc.
Exploratory data analysis
(Databricks)
Sessions in production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
ReportsAdhoc queries
Raw Sessions
ETL
Cube filler
S3
3. Split Analyzer into Event analyzer and Cube filler
- Add Cubefiller
- Switch pipeline to Analyzer -> Sessions, Cubefiller -> Cubes
- Refactor and clean up analyzer
Analyzer
Exploratory data analysis
(Databricks)
Sessions in production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
ReportsAdhoc queries
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
 Data processed once & consumed many times (from sessions)
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Keep cubes small (no geo, hourly, external dimensions)
✗ Recompute cube from events to add new breakdowns to existing metric (slow,
not exactly deterministic)
Move Cubes to Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
1. Backfilling & Testing
- Export Cube aggregates data from MySQL and import into
Snowflake
- Test performance of single queries, parallel queries for different
use cases: Dashboard, Reporting BI through Report generator, ...
Analyzer
Exploratory data analysis
(Databricks)
2. "Soft deploy", period of mirrored writes
- Add SnowflakeSink to Cubefiller (./bin/cube-filler --sinks
mysql,snowflake)
- Sink to MySQL Cubes and Snowflake Cubes in one run
- Investigate potential problems of snowflake sink in production,
compare the data, etc.
Cube
Aggregates
Move Cubes to Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
4. Validating SnowflakeAnalyzer results on 1% of requests.
/**
* Proxies requests to a main analyzer and a percent of requests to the
secondary analyzer and compares responses. When responses do not match,
it logs the context needed for troubleshooting.
*/
class ExperimentAnalyzer extends Analyzer
Analyzer
Exploratory data analysis
(Databricks)
5. Gradually switch REST API to get data from Snowflake instead of MySQL
- Start with 10% of requests, increase to 100%
- This allows us to see how Snowflake performs under load
Cube
Aggregates
3. REST API can read from Snowflake Cube aggregates (SnowflakeAnalyzer)
Check matches and response times
Cubes aggregates in Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
 Data processed once & consumed many times (from sessions)
 Fast schema changes in cubes (e.g. adding / removing metric)
 Can support geographical, hourly, external dimensions
✗ Recompute cube from sessions to add new breakdowns to existing metric
(slow, but deterministic)
Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
aggregating across different dimensions
Part 2: Simple• Move computation to data, not data to
computation
• Instead of exporting sessions to Spark
cluster only to compute aggregates, do it in
database with SELECT … FROM sessions
GROUP BY ...
Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
1. Period of double writes
./bin/cube-filler --logic sql,cubefiller --sink snowflakecubes
2. Compare differences, fix problems
Cube filler
Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST API
Services
Reports
Raw Sessions
Cube Aggregates
ETL
Analyzer
Exploratory data analysis
(Databricks)
3. Remove Spark completely from cubefiller
• REST API can read directly from Raw sessions OR from Cube Aggregates,
generating the same response
• Session schema
Known, well defined (by Session Scala model) and enforced
• Latest Session model
Authoritative source for sessions schema
• Historical sessions conform to the latest Session model
Can de-serialize any historical session
• Readers should ignore fields not in Session model
We do not guarantee to preserve this data
• Computing metrics, dimensions from Session model is time-invariant
Computed 1 year ago or today, numbers must be the same
How we handle schema evolution
Schema Evolution
Change in Session model Top level / scalar column Nested / VARIANT column
Rename field
ALTER TABLE tbl RENAME COLUMN
col1 TO col2; data rewrite (!)
Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite
Add field, no historical values
ALTER TABLE tbl ADD COLUMN col
type; no change necessary
Also considered views for VARIANT schema evolution
For complex scenarios have to use Javascript UDF => lose benefits of columnar access
Not good for practical use
• They are sometimes necessary
• We have the ability to do data rewrites
• Rewrite must maintain sort order fast access (UPDATE breaks it!)
• Did rewrite of 150TB of (compressed) data in December
• Complex and time consuming, so we fully automate them
• Costly, so we batch multiple changes together
• Javascript UDFs are our default approach for rewrites of data in VARIANT
Data Rewrites
• Expressive power of Javascript (vs. SQL)
• Run on the whole VARIANT record
• (Almost) constant performance
• More readable and understandable
• For changing a single field,
OBJECT_INSERT/OBJECT_DELETE
are preferred
Inline Rewrites with Javascript UDFs
CREATE OR REPLACE FUNCTION transform("json" variant)
RETURNS VARIANT
LANGUAGE JAVASCRIPT
AS '
// modify json
return json;
';
SELECT transform(json) FROM sessions;
• There are many ways to develop a data pipeline, none of them are perfect
• Make sure that load/scale testing is a required part of your go-to-production
plan
• Rollups make things cheaper, but at a great expense later
• Good abstractions survive a long time (e.g. Analytics API)
• Evolve pipeline modularly
• Maintain data consistency before/after
Stuff you have seen today...
Thank You.
P.S. We're hiring

Mais conteúdo relacionado

Mais procurados

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesDrew Hansen
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Databricks
 
New! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to SnowflakeNew! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to SnowflakePrecisely
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2inovex GmbH
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeRick van den Bosch
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data FactoryHARIHARAN R
 
Snowflake on AWS Workshop
Snowflake on AWS WorkshopSnowflake on AWS Workshop
Snowflake on AWS WorkshopDavid Teszler
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 

Mais procurados (20)

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 
New! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to SnowflakeNew! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to Snowflake
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Snowflake on AWS Workshop
Snowflake on AWS WorkshopSnowflake on AWS Workshop
Snowflake on AWS Workshop
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 

Semelhante a How we evolved data pipeline at Celtra and what we learned along the way

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jugGerald Muecke
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...confluent
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Amazon Web Services
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Rajesh Kumar
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 

Semelhante a How we evolved data pipeline at Celtra and what we learned along the way (20)

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
Tooling Up for Efficiency: DIY Solutions @ Netflix - ABD319 - re:Invent 2017
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 

Último

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

How we evolved data pipeline at Celtra and what we learned along the way

  • 1.
  • 2. How we evolved data pipeline at Celtra and what we learned along the way
  • 3. whoami Grega Kespret Director of Engineering, Analytics @ San Francisco @gregakespret github.com/gregakespret slideshare.net/gregak
  • 5. Big Data Problems vs. Big Data Problems This is not going to be a talk about Data Science
  • 7. 700,000 Ads Built 70,000 Campaigns 5000 Brands 6G Analytics Events / Day 1TB Compressed Data / Day Celtra in Numbers 300 Metrics 200 Dimensions
  • 8. Growth The analytics platform at Celtra has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. Business Data Volume Data ComplexityImpressions
  • 9.
  • 10. • INSERT INTO impressions in track.php => 1 row per impression • SELECT creativeId, placementId, sdk, platform, COUNT(*) sessions FROM impressions WHERE campaignId = ... GROUP BY creativeId, placementId, sdk, platform • It was obvious this wouldn't scale, pretty much an anti-pattern MySQL for everything Trackers Client-facing dashboards REST APIRead replicaRaw impressions
  • 12. BigQuery (private beta) Trackers Raw impressions Client-facing dashboards REST API
  • 13. Hive + Events + Cube aggregates Trackers Events S3 Hive Client-facing dashboards REST APICubes (read replica) Cube Aggregates ETL • Separated write and read paths • Write events to S3, read and process with Hive, store to MySQL, asynchronously • Tried & tested, conservative solution • Conceptually almost 1:1 to BigQuery (almost the same SQL) Events
  • 14. • Immutable, append-only set of raw data • Store on S3 s3://celtra-mab/events/2017-09-22/13-15/3fdc5604/part0000.events.gz • Point in time facts about what happened • Bread and butter of our analytics data • JSON, one event per line - server & client timestamp - crc32 to detect invalid content - index to detect missing events - instantiation to deduplicate events - accountId to partition the data - name, implicitly defines the schema (very sparse) Event data
  • 15. • Replaced Hive with Spark (the new kid on the block) • Spark 0.5 in production • sessionization Spark + Events + Cube aggregates Trackers Raw Events S3 Events Analyzer Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis ReportsAdhoc queries
  • 16. == Combining discrete events into sessions • Complex relationships between events • Patterns more interesting than sums and counts of events • Easier to troubleshoot/debug with context • Able to check for/enforce causality (if X happened, Y must also have happened) • De-duplication possible (no skewed rates because of outliers) • Later events reveal information about earlier arriving events (e.g. session duration, attribution, etc.) Sessionization Sessionization Events Sessions
  • 17. • Expressive computation layer: Provides nice functional abstractions over distributed collections • Get full expressive power of Scala for ETL • Complex ETL: de-duplicate, sessionize, clean, validate, emit facts • Shuffle needed for sessionization • Seamless integration with S3 • Speed of innovation Spark (we don't really use groupByKey any more)
  • 18. Simplified Data Model Session Unit view Page view Creative Campaign Interaction sdk time supplier ... unit name ... name time on page ... Account Placement
  • 19. 1. Denormalize ✓ Speed (aggregations without joins) ✗ Expensive storage 2. Normalize ✗ Speed (joins) ✓ Cheap storage Denormalize vs. normalize
  • 20. • Idea: store sessions (unaggregated data after sessionization) instead of cubes • Columnar MPP database • Normalized logical schema, denormalized physical schema (projections) • Use pre-join projections to move the join step into the load Vertica Trackers Raw Events S3 Analyzer Client-facing dashboards REST APIRaw Sessions 8 VPC nodes
  • 21. Hash Joins vs. Merge Joins
  • 22. • Always use MERGE JOINS (as opposed to HASH JOINS) • Great performance and fast POC => start with implementation • Pre-production load testing => a lot of problems: - Inserts do no scale (VER-25968) - Merge-join is single-threaded (VER-27781) - Non-transitivity of pre-join projections - Remove column problem - Cannot rebuild data - ... • Decision to abandon the project Vertica Trackers Raw Events S3 Analyzer Client-facing dashboards REST APIRaw Sessions
  • 23. • Complex setup and configuration required • Analyses not reproducible and repeatable • No collaboration • Moving data between different stages in troubleshooting/analysis lifecycle (e.g. Scala for aggregations, R for viz) • Heterogeneity of the various components (Spark in production, something else for exploratory data analysis) • Analytics team (3 people) bottleneck Difficult to Analyze the Data Collected Trackers Raw Events S3 Events Analyzer Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis ReportsAdhoc queries
  • 24. • Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.) • Needed answers to questions that existing data model did not support • Visualizations • For example: • Analyzing effects of placement position on engagement rates • Troubleshooting 95th percentile of ad loading time performance Difficult to Analyze the Data Collected Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis ReportsAdhoc queries Analyzer
  • 25. • Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.) • Needed answers to questions that existing data model did not support • Visualizations • For example: • Analyzing effects of placement position on engagement rates • Troubleshooting 95th percentile of ad loading time performance Difficult to Analyze the Data Collected Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis (Databricks) ReportsAdhoc queries Analyzer
  • 26. ✗ Complex ETL repeated in adhoc queries (slow, error-prone) ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) ✗ Keep cubes small (no geo, hourly, external dimensions) ✗ Recompute cube from events to add new breakdowns to existing metric (slow, not exactly deterministic) Some of the problems Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis (Databricks) ReportsAdhoc queries Analyzer
  • 27. Idea: Split ETL, Materialize Sessions deduplication, sessionization, cleaning, validation, external dependencies aggregating across different dimensions Part 1: Complex Part 2: Simple Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis (Databricks) ReportsAdhoc queries Raw Sessions ETL ETL Analyzer
  • 28. Requirements • Fully managed service • Columnar storage format • Support for complex nested structures • Schema evolution possible • Data rewrites possible • Scale compute resources separately from storage Data Warehouse to store sessions Nice-to-Haves • Transactions • Partitioning • Skipping • Access control • Appropriate for OLAP use case
  • 30. • Too many small files problem => file stitching • No consistency guarantees over set of files on S3 => secondary index | convention • Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog), Storage format layer (Parquet), Data layer (S3) Spark + HCatalog + Parquet + S3 We really wanted a database-like abstraction with transactions, not a file format!
  • 31. Operational tasks with Vertica: • Replace failed node • Refresh projection • Restart database with one node down • Remove dead node from DNS • Ensure enough (at least 2x) disk space available for rewrites • Backup data • Archive data Why We Wanted a Managed Service We did not want to deal with these tasks
  • 32. 1. Denormalize ✓ Speed (aggregations without joins) ✗ Expensive storage 2. Normalize ✗ Speed (joins) ✓ Cheap storage 3. Normalized logical schema, denormalized physical schema (Vertica use case) ✓ Speed (move the join step into the load) ✓ Cheap storage 4. Nested objects: pre-group the data on each grain ✓ Speed (a "join" between parent and child is essentially free) ✓ Cheap storage Support for complex nested structures
  • 33. { "id": "s1523120401x263215af605420x35562961", "creativeId": "f21d6f4f", "actualDeviceType": "Phone", "loaded": true, "rendered": true, "platform": "IOS", ... "unitShows": [ { "unit": { ... }, "unitVariantShows": [ { "unitVariant": { ... }, "screenShows": [ { "screen": { ... }, }, { "screen": {... }, } ], "hasInteraction": false } ], "screenDepth": "2", "hasInteraction": false } ] } Pre-group the data on each grain
  • 34. Flat + Normalized Nested Find top 10 pages on creative units with most interactions on average Flat vs. Nested Queries SELECT creativeId, us.name unitName, ss.name pageName, AVG(COUNT(*)) avgInteractions FROM sessions s JOIN unitShows us ON us.id = s.id JOIN screenShows ss ON ss.usid = us.id JOIN interactions i ON i.ssid = ss.id GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10 Distributed join turned into local joinJoins Requires unique ID at every grain SELECT creativeId, unitShows.value:name unitName, screenShows.value:name pageName, AVG(ARRAY_SIZE(screenShows.value:interactions)) avgInteractions FROM sessions, LATERAL FLATTEN(json:unitShows) unitShows, LATERAL FLATTEN(unitShows.value:screenShows) screenShows GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10
  • 35. Sessions: Path to production Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services ReportsAdhoc queries Raw Sessions ETL S3 1. Backfilling - Materialized historical sessions of last 2 years and inserted into Snowflake - Verify counts for each day versus cube aggregates, investigate discrepancies & fix Analyzer 2. "Soft deploy", period of mirrored writes - Add SnowflakeSink to Analyzer (./bin/analyzer --sinks mysql,snowflake) - Sink to Cubes and Snowflake in one run (sequentially) - Investigate potential problems of snowflake sink in production, compare the data, etc. Exploratory data analysis (Databricks)
  • 36. Sessions in production Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services ReportsAdhoc queries Raw Sessions ETL Cube filler S3 3. Split Analyzer into Event analyzer and Cube filler - Add Cubefiller - Switch pipeline to Analyzer -> Sessions, Cubefiller -> Cubes - Refactor and clean up analyzer Analyzer Exploratory data analysis (Databricks)
  • 37. Sessions in production Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services ReportsAdhoc queries Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks)  Data processed once & consumed many times (from sessions) ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) ✗ Keep cubes small (no geo, hourly, external dimensions) ✗ Recompute cube from events to add new breakdowns to existing metric (slow, not exactly deterministic)
  • 38. Move Cubes to Snowflake Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Reports Raw Sessions ETL Cube filler S3 1. Backfilling & Testing - Export Cube aggregates data from MySQL and import into Snowflake - Test performance of single queries, parallel queries for different use cases: Dashboard, Reporting BI through Report generator, ... Analyzer Exploratory data analysis (Databricks) 2. "Soft deploy", period of mirrored writes - Add SnowflakeSink to Cubefiller (./bin/cube-filler --sinks mysql,snowflake) - Sink to MySQL Cubes and Snowflake Cubes in one run - Investigate potential problems of snowflake sink in production, compare the data, etc. Cube Aggregates
  • 39. Move Cubes to Snowflake Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Reports Raw Sessions ETL Cube filler S3 4. Validating SnowflakeAnalyzer results on 1% of requests. /** * Proxies requests to a main analyzer and a percent of requests to the secondary analyzer and compares responses. When responses do not match, it logs the context needed for troubleshooting. */ class ExperimentAnalyzer extends Analyzer Analyzer Exploratory data analysis (Databricks) 5. Gradually switch REST API to get data from Snowflake instead of MySQL - Start with 10% of requests, increase to 100% - This allows us to see how Snowflake performs under load Cube Aggregates 3. REST API can read from Snowflake Cube aggregates (SnowflakeAnalyzer)
  • 40. Check matches and response times
  • 41. Cubes aggregates in Snowflake Trackers Raw Events S3 Events Client-facing dashboards REST APICube Aggregates Services Reports Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks)  Data processed once & consumed many times (from sessions)  Fast schema changes in cubes (e.g. adding / removing metric)  Can support geographical, hourly, external dimensions ✗ Recompute cube from sessions to add new breakdowns to existing metric (slow, but deterministic)
  • 42. Compute cube aggregates with SQL Trackers Raw Events S3 Events Client-facing dashboards REST APICube Aggregates Services Reports Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks) aggregating across different dimensions Part 2: Simple• Move computation to data, not data to computation • Instead of exporting sessions to Spark cluster only to compute aggregates, do it in database with SELECT … FROM sessions GROUP BY ...
  • 43. Compute cube aggregates with SQL Trackers Raw Events S3 Events Client-facing dashboards REST APICube Aggregates Services Reports Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks) 1. Period of double writes ./bin/cube-filler --logic sql,cubefiller --sink snowflakecubes 2. Compare differences, fix problems Cube filler
  • 44. Compute cube aggregates with SQL Trackers Raw Events S3 Events Client-facing dashboards REST API Services Reports Raw Sessions Cube Aggregates ETL Analyzer Exploratory data analysis (Databricks) 3. Remove Spark completely from cubefiller • REST API can read directly from Raw sessions OR from Cube Aggregates, generating the same response
  • 45. • Session schema Known, well defined (by Session Scala model) and enforced • Latest Session model Authoritative source for sessions schema • Historical sessions conform to the latest Session model Can de-serialize any historical session • Readers should ignore fields not in Session model We do not guarantee to preserve this data • Computing metrics, dimensions from Session model is time-invariant Computed 1 year ago or today, numbers must be the same How we handle schema evolution
  • 46. Schema Evolution Change in Session model Top level / scalar column Nested / VARIANT column Rename field ALTER TABLE tbl RENAME COLUMN col1 TO col2; data rewrite (!) Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite Add field, no historical values ALTER TABLE tbl ADD COLUMN col type; no change necessary Also considered views for VARIANT schema evolution For complex scenarios have to use Javascript UDF => lose benefits of columnar access Not good for practical use
  • 47. • They are sometimes necessary • We have the ability to do data rewrites • Rewrite must maintain sort order fast access (UPDATE breaks it!) • Did rewrite of 150TB of (compressed) data in December • Complex and time consuming, so we fully automate them • Costly, so we batch multiple changes together • Javascript UDFs are our default approach for rewrites of data in VARIANT Data Rewrites
  • 48. • Expressive power of Javascript (vs. SQL) • Run on the whole VARIANT record • (Almost) constant performance • More readable and understandable • For changing a single field, OBJECT_INSERT/OBJECT_DELETE are preferred Inline Rewrites with Javascript UDFs CREATE OR REPLACE FUNCTION transform("json" variant) RETURNS VARIANT LANGUAGE JAVASCRIPT AS ' // modify json return json; '; SELECT transform(json) FROM sessions;
  • 49. • There are many ways to develop a data pipeline, none of them are perfect • Make sure that load/scale testing is a required part of your go-to-production plan • Rollups make things cheaper, but at a great expense later • Good abstractions survive a long time (e.g. Analytics API) • Evolve pipeline modularly • Maintain data consistency before/after Stuff you have seen today...

Notas do Editor

  1. got private beta through connections almost no documentation, something about SQL migration was a couple of days (big factor) amazing response times, also for bigger campaigns in time, response times got slower and slower; nobody knew why; talked to Google, nobody mentioned they are doing full-table scans at some point all campaign queries were 30s, a little later >60s => browser timeout and total breakage combination of popularity and retries (because of >60s) caused us to go over query quotas on some days, which was even bigger disaster
  2. Separated write and read paths Write events to S3, read and process with Hive, store to MySQL, asynchronously Tried & tested, conservative solution Conceptually almost 1:1 to BigQuery (almost the same SQL)
  3. Managed service Schema evolution How fast are schema changes? (adding column, removing column, adding column with default value, computing a new column from other data, changing field type)? Do we need to do data rewrite to change schema? If so, how long does it take to migrate all data to new schema? Estimated cost of data rewrite? Storage and nested objects Support for complex nested data structures Columnar storage (nested objects "columnarized"?) Data storage format? Is the format open or closed? Workload management Scaling compute resources separately from storage