Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

•

4 gostaram•4,038 visualizações

Charles Allen covers data processing, analytics, and insights systems at Snap. Strength points for Druid use cases are called out as are differences in some of the processing systems used. This is the slide collection from the second talk from: https://www.meetup.com/druidio-la/events/254080924/

Dados e análise

Snapchat 2018
Analytics at
Snap
Big Data processing, slicing, and dicing
Charles Allen
charles.allen@snap.com
https://www.linkedin.com/in/charles-allen-255bab2a/

09.20.18
Who we are
Snap growth
Wrangling Data / Data tool chest
Druid’s powerhouse
Overview

Express yourself!
place creative here place creative here

Million DAU Q2
2014
Million DAU Q2
188
2018
Source: 10-K; 10-Q; earnings call transcripts
User base up
Advertiser value up
57

Lack of data
causes pain
Natural pipeline development
Need
Find data signal,
and data
processing SME
Source
Work with
development
team for pipeline
Develop
To production!
Deploy
Fire and forget,
or keep it live?
Maintain
Getting insights into data

Common data consumption formats
Scripting
High level of expertise
Extremely dynamic
Usually either one-off for a specific
human. Or scripted for machine
consumption.
DashboardsReports
Small qty of KPIs
Big tables or worksheets
“Executive” summarization
Multiple KPIs
Curated by expert
Some flexibility
Often operational in nature or usage

Headline Center, Sub, Labels, 6-Screens Yellow
Stream buffer
Kafka
Stream buffer
Pubsub
Batch processing
orchestration
Airflow
Bundle storage
Storage
Key architecture components for data flow control
ICON

Key architecture components for business logic
Stream and Batch
processing
Dataflow
Pipeline business logic
Beam
Popular language
Python
Popular language
Java
Stream and batch
processing
Spark

Headline Center, Sub, Labels, 6-Screens Yellow
Bulk data warehousing
Big Query
Exploratory data storage
Druid
Druid centric
dashboarding
Superset
General dashboarding
Looker
Key architecture components for data consumption

Core event log workflows
GDPR
SOX
● Bundle lands in GCS
● Airflow churns data
between BigQuery and
GCS
● Over 20k DAG runs a
week
● Lots of access control

Druid vs BigQuery
Druid
Multi cloud compatible.
Higher friction data load.
Lower friction data maintenance.
Gets more affordable with more usage.
You will track who has the most data.
Very fast.
Slice and dice.
BigQuery
Fully managed and hosted, GCP-only.
Low friction data load.
High friction data maintenance.
Price punishment for using too much.
You will track who is causing cost spikes.
Often slow, but faster than hadoop.
Joins.
Internal use cases for Druid vs BigQuery

Large compute capacity
Cores
>10k
Flowing into Druid
Events per day
>100B
Answered
Queries per day
>100k
Key Druid stats

Druid ingestion and consumption
Reports /
Dashboards
SME
Dashboards
Drill Down

Data Storage & Querying
Platform
Platform GKE Cluster
ZooKeeper
Coordination &
configuration
Druid
Indexed datastore
Java, Druid
Druid
Indexed datastore
Java, Druid
Druid Broker
Druid Historicals*
Druid Coordinator
Java, CoreOS, Druid,
GCE
Mesos
Cluster Management
GCE
Marathon
Orchestration
GCE
GCS
Deep
Storage
CloudSQL
Druid
Metadata
ZooKeeper
Coordination &
Configuration
ZooKeeper
Coordination &
configuration
MongoDB
Query Time Lookup
Cache
● GCP Deployment Manager
● Helm

Recent data FAST
NVME-SSD
1 Week
2 Hot
Recent data HA
1 Week
1 Cold
Keep older data available
Older Data
HADruid retention
tunings

We Are Hiring!
charles.allen@snap.com
https://www.snap.com/jobs/

Mais conteúdo relacionado

Mais procurados

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn

Sql vs NoSQLRTigger

Zookeeper big sonataAnh Le

A Deep Dive Into Understanding Apache CassandraDataStax Academy

Introduction to CassandraGokhan Atil

Elasticsearch for beginnersNeil Baker

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Scalability, Availability & Stability PatternsJonas Bonér

A Brief History of Database Management (SQL, NoSQL, NewSQL)Abdelkader OUARED

How to Reduce Your Database Total Cost of Ownership with TimescaleDBTimescale

introduction to NOSQL Databasenehabsairam

Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit

Hadoop & Cloudera WorkshopSerkan Sakınmaz

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Distributed document based systemChetan Selukar

Nosql databasesateeq ateeq

Apache sparkshima jafari

Consistency in NoSQLDr-Dipali Meher

Apache Spark Introductionsudhakara st

Hadoop And Their Ecosystem pptsunera pathan

Mais procurados (20)

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Sql vs NoSQL

Zookeeper big sonata

A Deep Dive Into Understanding Apache Cassandra

Introduction to Cassandra

Elasticsearch for beginners

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Scalability, Availability & Stability Patterns

A Brief History of Database Management (SQL, NoSQL, NewSQL)

How to Reduce Your Database Total Cost of Ownership with TimescaleDB

introduction to NOSQL Database

Druid and Hive Together : Use Cases and Best Practices

Hadoop & Cloudera Workshop

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Distributed document based system

Nosql databases

Apache spark

Consistency in NoSQL

Apache Spark Introduction

Hadoop And Their Ecosystem ppt

Semelhante a Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Using real time big data analytics for competitive advantageAmazon Web Services

Big Data on Azure Tutorialrustd

Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.

Big Data LDN 2017: The New Dominant Companies Are Running on DataMatt Stubbs

KNIME Meetup 2016-04-16W. Daniel Cox, III CMA, CFM

The new dominant companies are running on data SnapLogic

Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.

The Future of Data Management: The Enterprise Data HubCloudera, Inc.

The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu

Analytics in a Day Ft. Synapse Virtual WorkshopCCG

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman

Gartner peer forum sept 2011 orbitzRaghu Kashyap

How to implement Hadoop successfullyAdir Sharabi

Unlock Data-driven Insights in Databricks Using Location IntelligencePrecisely

Tapping the cloud for real time data analyticsAmazon Web Services

Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo

Big Data in AzureDataWorks Summit/Hadoop Summit

Analytics in a Day Virtual WorkshopCCG

8.17.11 big data and hadoop with informatica slideshareJulianna DeLua

Semelhante a Data Analytics and Processing at Snap - Druid Meetup LA - September 2018 (20)

Using real time big data analytics for competitive advantage

Big Data on Azure Tutorial

Enabling Next Gen Analytics with Azure Data Lake and StreamSets

Big Data LDN 2017: The New Dominant Companies Are Running on Data

KNIME Meetup 2016-04-16

The new dominant companies are running on data

Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...

The Future of Data Management: The Enterprise Data Hub

The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios

Analytics in a Day Ft. Synapse Virtual Workshop

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011

Gartner peer forum sept 2011 orbitz

How to implement Hadoop successfully

Unlock Data-driven Insights in Databricks Using Location Intelligence

Tapping the cloud for real time data analytics

Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...

Big Data in Azure

Analytics in a Day Virtual Workshop

8.17.11 big data and hadoop with informatica slideshare

Último

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

IMA MSN - Medical Students Network (2).pptxdolaknnilon

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

1. Snapchat 2018 Analytics at Snap Big Data processing, slicing, and dicing Charles Allen charles.allen@snap.com https://www.linkedin.com/in/charles-allen-255bab2a/

2. 09.20.18 Who we are Snap growth Wrangling Data / Data tool chest Druid’s powerhouse Overview

3. Who we are

6. Snap Inc. is a camera company

7. Express yourself! place creative here place creative here

9. Live in the moment place creative here

10. Snap growth

11. Million DAU Q2 2014 Million DAU Q2 188 2018 Source: 10-K; 10-Q; earnings call transcripts User base up Advertiser value up 57

12. Trillions of interactions per week.

13. Wrangling data

14. Lack of data causes pain Natural pipeline development Need Find data signal, and data processing SME Source Work with development team for pipeline Develop To production! Deploy Fire and forget, or keep it live? Maintain Getting insights into data

15. Common data consumption formats Scripting High level of expertise Extremely dynamic Usually either one-off for a specific human. Or scripted for machine consumption. DashboardsReports Small qty of KPIs Big tables or worksheets “Executive” summarization Multiple KPIs Curated by expert Some flexibility Often operational in nature or usage

16.

17. Data tool chest

18. Headline Center, Sub, Labels, 6-Screens Yellow Stream buffer Kafka Stream buffer Pubsub Batch processing orchestration Airflow Bundle storage Storage Key architecture components for data flow control ICON

19. Key architecture components for business logic Stream and Batch processing Dataflow Pipeline business logic Beam Popular language Python Popular language Java Stream and batch processing Spark

20. Headline Center, Sub, Labels, 6-Screens Yellow Bulk data warehousing Big Query Exploratory data storage Druid Druid centric dashboarding Superset General dashboarding Looker Key architecture components for data consumption

21. Core event log workflows GDPR SOX ● Bundle lands in GCS ● Airflow churns data between BigQuery and GCS ● Over 20k DAG runs a week ● Lots of access control

22. Druid vs BigQuery Druid Multi cloud compatible. Higher friction data load. Lower friction data maintenance. Gets more affordable with more usage. You will track who has the most data. Very fast. Slice and dice. BigQuery Fully managed and hosted, GCP-only. Low friction data load. High friction data maintenance. Price punishment for using too much. You will track who is causing cost spikes. Often slow, but faster than hadoop. Joins. Internal use cases for Druid vs BigQuery

23. Druid’s powerhouse

24. Large compute capacity Cores >10k Flowing into Druid Events per day >100B Answered Queries per day >100k Key Druid stats

25. Druid ingestion and consumption Reports / Dashboards SME Dashboards Drill Down

26. Data Storage & Querying Platform Platform GKE Cluster ZooKeeper Coordination & configuration Druid Indexed datastore Java, Druid Druid Indexed datastore Java, Druid Druid Broker Druid Historicals* Druid Coordinator Java, CoreOS, Druid, GCE Mesos Cluster Management GCE Marathon Orchestration GCE GCS Deep Storage CloudSQL Druid Metadata ZooKeeper Coordination & Configuration ZooKeeper Coordination & configuration MongoDB Query Time Lookup Cache ● GCP Deployment Manager ● Helm

27. Recent data FAST NVME-SSD 1 Week 2 Hot Recent data HA 1 Week 1 Cold Keep older data available Older Data HADruid retention tunings

28. We Are Hiring! charles.allen@snap.com https://www.snap.com/jobs/

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Semelhante a Data Analytics and Processing at Snap - Druid Meetup LA - September 2018 (20)

Último

Último (20)

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018