Charles Allen covers data processing, analytics, and insights systems at Snap. Strength points for Druid use cases are called out as are differences in some of the processing systems used.
This is the slide collection from the second talk from:
https://www.meetup.com/druidio-la/events/254080924/
Identifying Appropriate Test Statistics Involving Population Mean
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
1. Snapchat 2018
Analytics at
Snap
Big Data processing, slicing, and dicing
Charles Allen
charles.allen@snap.com
https://www.linkedin.com/in/charles-allen-255bab2a/
14. Lack of data
causes pain
Natural pipeline development
Need
Find data signal,
and data
processing SME
Source
Work with
development
team for pipeline
Develop
To production!
Deploy
Fire and forget,
or keep it live?
Maintain
Getting insights into data
15. Common data consumption formats
Scripting
High level of expertise
Extremely dynamic
Usually either one-off for a specific
human. Or scripted for machine
consumption.
DashboardsReports
Small qty of KPIs
Big tables or worksheets
“Executive” summarization
Multiple KPIs
Curated by expert
Some flexibility
Often operational in nature or usage
18. Headline Center, Sub, Labels, 6-Screens Yellow
Stream buffer
Kafka
Stream buffer
Pubsub
Batch processing
orchestration
Airflow
Bundle storage
Storage
Key architecture components for data flow control
ICON
19. Key architecture components for business logic
Stream and Batch
processing
Dataflow
Pipeline business logic
Beam
Popular language
Python
Popular language
Java
Stream and batch
processing
Spark
20. Headline Center, Sub, Labels, 6-Screens Yellow
Bulk data warehousing
Big Query
Exploratory data storage
Druid
Druid centric
dashboarding
Superset
General dashboarding
Looker
Key architecture components for data consumption
21. Core event log workflows
GDPR
SOX
● Bundle lands in GCS
● Airflow churns data
between BigQuery and
GCS
● Over 20k DAG runs a
week
● Lots of access control
22. Druid vs BigQuery
Druid
Multi cloud compatible.
Higher friction data load.
Lower friction data maintenance.
Gets more affordable with more usage.
You will track who has the most data.
Very fast.
Slice and dice.
BigQuery
Fully managed and hosted, GCP-only.
Low friction data load.
High friction data maintenance.
Price punishment for using too much.
You will track who is causing cost spikes.
Often slow, but faster than hadoop.
Joins.
Internal use cases for Druid vs BigQuery