Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

•

3 likes•1,019 views

GoPro is a powerful global brand, thanks in large part to its innovative cameras and accessories that capture moments other cameras just miss: surfing in Maui, skiing in Tahoe, recording your child’s first steps. And today, the company is nearly as well known for its user-generated social and content networks. Join us for this special webinar hosted by Tableau, Trifacta, and Cloudera—featuring GoPro. We’ll dive into GoPro’s data strategy and architecture, from ingest and processing to data prep and reporting, all on AWS.

Software

1© Cloudera, Inc. All rights reserved.
Extreme Sports & Beyond:
Exploring a new frontier in data with GoPro

2© Cloudera, Inc. All rights reserved.
Josh is the Manager of Data Architecture and Operations at
GoPro working within the Data Science & Engineering team.
Prior to GoPro, Josh led global supply chain operations analytics
efforts at Apple. He has 15 years of project, analytics,
operations, and business intelligence experience in a variety of
industries. He holds a BS in Industrial Engineering and an MBA
from the University of Texas at Austin.
David is a Principal Engineer in the Data Science and
Engineering team at GoPro and the designer of their Spark-
Kafka data ingestion pipeline. David has been developing
scalable data processing pipelines and eCommerce systems for
over 20 years in Silicon Valley. David’s current big data interests
include streaming data as fast as possible from devices to near
real-time dashboards and switching his primary programming
language to Scala from Java after 17 years. He holds a BS in
Computer Science from The Ohio State University.
Our Speakers
Josh Byrd
Manager, Data
Architecture &
Operations, GoPro
David Winters
Principal Engineer, Data
Science & Engineering,
GoPro

3© Cloudera, Inc. All rights reserved.
Let’s do this!

4© Cloudera, Inc. All rights reserved.
Our Story
When we got here…

5© Cloudera, Inc. All rights reserved.
Growing data needs

6© Cloudera, Inc. All rights reserved.
Dev Ops
• Infrastructure
• Hadoop Admin
Engineering
• Data pipeline
• ETL processing
Architecture
• Design
• Applications
Project
Management
• Agile
Data Science and Engineering
We build the platform

7© Cloudera, Inc. All rights reserved.
Origin Story
•Make Friends
•Haul Ass
•Maintain Balance
•No Half-Assery
•Integrity. Always
•Be a HERO
Yes, this comes from the top…

8© Cloudera, Inc. All rights reserved.
GoPro Desktop Application

9© Cloudera, Inc. All rights reserved.
GoPro Desktop Application on Tableau

10© Cloudera, Inc. All rights reserved.
How the Magic Happens
The Philosopher’s Stone…
…or TPS for short

11© Cloudera, Inc. All rights reserved.
High Level Architecture
ETL Cluster
•File dumps
•MR
•Hive
Secure Data
Mart
•End User Query
•Impala / Sentry
•Parquet
Analytics Apps
•HUE
•Tableau
•Trifacta
•R
Ingest Cluster
•Log file streaming
•Kafka
•Spark
Induction
Framework
•Batch files
•Pre-processing
•Java
Original Cluster

12© Cloudera, Inc. All rights reserved.
Data Pipeline
Ingest Cluster
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster

13© Cloudera, Inc. All rights reserved.
Data Pipeline
/path1/…
/path2/…
/path3/…
ToETLCluster
/path4/…

14© Cloudera, Inc. All rights reserved.
Data Pipeline
ETL Cluster
HDFS
HIVE Metastore
To SDM Cluster
From Ingest Cluster
Induction
framework

15© Cloudera, Inc. All rights reserved.
Data Delivery!
HDFS
HIVE Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio - Staging
GDA
Report
SDM
From ETL Cluster

16© Cloudera, Inc. All rights reserved.
Trifacta

17© Cloudera, Inc. All rights reserved.
GoPro Desktop Application on Tableau

18© Cloudera, Inc. All rights reserved.
PG #
RC Playbook: Your guide to
success at GoPro
Questions?

19© Cloudera, Inc. All rights reserved.
Learn more about Cloudera, Tableau, & Trifacta
http://www.cloudera.com/partners/solutions/trifacta.html http://www.cloudera.com/partners/solutions/tableau.html

20© Cloudera, Inc. All rights reserved.
Thank You

What's hot

Supercharge Splunk with Cloudera Cloudera, Inc.

Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.

How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.

A Community Approach to Fighting Cyber ThreatsCloudera, Inc.

Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.

Seeking Cybersecurity--Strategies to Protect the DataCloudera, Inc.

How Data Drives Business at Choice HotelsCloudera, Inc.

Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.

Analyzing Hadoop Data Using Sparklyr Cloudera, Inc.

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.

Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.

Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.

Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.

Kudu Forrester WebinarCloudera, Inc.

Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.

Data Science and CDSWJason Hubbard

What's hot (20)

Supercharge Splunk with Cloudera 

Simplifying Real-Time Architectures for IoT with Apache Kudu

How to Build Continuous Ingestion for the Internet of Things

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...

A Community Approach to Fighting Cyber Threats

Standing Up an Effective Enterprise Data Hub -- Technology and Beyond

Seeking Cybersecurity--Strategies to Protect the Data

How Data Drives Business at Choice Hotels

Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud

Analyzing Hadoop Data Using Sparklyr 

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...

Unlock Hadoop Success with Cloudera Navigator Optimizer

Cloudera Altus: Big Data in the Cloud Made Easy

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...

Part 3: Models in Production: A Look From Beginning to End

Apache Impala (incubating) 2.5 Performance Update

Kudu Forrester Webinar

Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...

Data Science and CDSW

Similar to Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

Data Science in the EnterpriseThe Hive

Part 2: A Visual Dive into Machine Learning and Deep Learning  Cloudera, Inc.

PyData: The Next GenerationWes McKinney

Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.

Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit

PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.

What it takes to bring Hadoop to a production-ready stateClouderaUserGroups

The Future of Data Management: The Enterprise Data HubCloudera, Inc.

Unlocking data science in the enterprise - with Oracle and ClouderaCloudera, Inc.

Enabling Python to be a Better Big Data CitizenWes McKinney

Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.

Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson

Datameer6 for prospects - june 2016_v2Datameer

BigData_Krishna Kumar SharmaKrishna Kumar Sharma

Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp

Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook

Horses for Courses: Database RoundtableEric Kavanagh

The path to success with Graph Database and Graph Data ScienceNeo4j

Demystifying Data Warehousing as a Service - DFWKent Graziano

Big Data Ecosystem- Impetus TechnologiesImpetus Technologies

Similar to Extreme Sports & Beyond: Exploring a new frontier in data with GoPro (20)

Data Science in the Enterprise

Part 2: A Visual Dive into Machine Learning and Deep Learning  

PyData: The Next Generation

Data Science at Scale Using Apache Spark and Apache Hadoop

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

PyData: The Next Generation | Data Day Texas 2015

What it takes to bring Hadoop to a production-ready state

The Future of Data Management: The Enterprise Data Hub

Unlocking data science in the enterprise - with Oracle and Cloudera

Enabling Python to be a Better Big Data Citizen

Part 1: Introducing the Cloudera Data Science Workbench

Large-Scale Data Science on Hadoop (Intel Big Data Day)

Datameer6 for prospects - june 2016_v2

BigData_Krishna Kumar Sharma

Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017

Hadoop Application Architectures tutorial at Big DataService 2015

Horses for Courses: Database Roundtable

The path to success with Graph Database and Graph Data Science

Demystifying Data Warehousing as a Service - DFW

Big Data Ecosystem- Impetus Technologies

Recently uploaded

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Recruitment Management Software Benefits (Infographic)Hr365.us smith

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Sending Calendar Invites on SES and Calendarsnack.pdf31events.com

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Introduction Computer Science - Software Design.pdfFerryKemperman

React Server Component in Next.js by Hanief UtamaHanief Utama

Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley

Powering Real-Time Decisions with Continuous Data StreamsSafe Software

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Recently uploaded (20)

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

What is Advanced Excel and what are some best practices for designing and cre...

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Automate your Kamailio Test Calls - Kamailio World 2024

Implementing Zero Trust strategy with Azure

Machine Learning Software Engineering Patterns and Their Engineering

Folding Cheat Sheet #4 - fourth in a series

Recruitment Management Software Benefits (Infographic)

PREDICTING RIVER WATER QUALITY ppt presentation

Sending Calendar Invites on SES and Calendarsnack.pdf

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Introduction Computer Science - Software Design.pdf

React Server Component in Next.js by Hanief Utama

Comparing Linux OS Image Update Models - EOSS 2024.pdf

Powering Real-Time Decisions with Continuous Data Streams

2.pdf Ejercicios de programación competitiva

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Odoo 14 - eLearning Module In Odoo 14 Enterprise

Ahmed Motair CV April 2024 (Senior SW Developer)

Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

2. 2© Cloudera, Inc. All rights reserved. Josh is the Manager of Data Architecture and Operations at GoPro working within the Data Science & Engineering team. Prior to GoPro, Josh led global supply chain operations analytics efforts at Apple. He has 15 years of project, analytics, operations, and business intelligence experience in a variety of industries. He holds a BS in Industrial Engineering and an MBA from the University of Texas at Austin. David is a Principal Engineer in the Data Science and Engineering team at GoPro and the designer of their Spark- Kafka data ingestion pipeline. David has been developing scalable data processing pipelines and eCommerce systems for over 20 years in Silicon Valley. David’s current big data interests include streaming data as fast as possible from devices to near real-time dashboards and switching his primary programming language to Scala from Java after 17 years. He holds a BS in Computer Science from The Ohio State University. Our Speakers Josh Byrd Manager, Data Architecture & Operations, GoPro David Winters Principal Engineer, Data Science & Engineering, GoPro

6. 6© Cloudera, Inc. All rights reserved. Dev Ops • Infrastructure • Hadoop Admin Engineering • Data pipeline • ETL processing Architecture • Design • Applications Project Management • Agile Data Science and Engineering We build the platform

11. 11© Cloudera, Inc. All rights reserved. High Level Architecture ETL Cluster •File dumps •MR •Hive Secure Data Mart •End User Query •Impala / Sentry •Parquet Analytics Apps •HUE •Tableau •Trifacta •R Ingest Cluster •Log file streaming •Kafka •Spark Induction Framework •Batch files •Pre-processing •Java Original Cluster

19. 19© Cloudera, Inc. All rights reserved. Learn more about Cloudera, Tableau, & Trifacta http://www.cloudera.com/partners/solutions/trifacta.html http://www.cloudera.com/partners/solutions/tableau.html

Editor's Notes

When we got here a little over two years ago, all we did was sell cameras. It was our job assess the data landscape, understand the roadmap, and to ultimately plan and implement an Enterprise Data Platform to support the company.
Here’s what we saw… - Business was indeed growing, the product line was expanding in number and sophistication, BUT we were becoming more than a camera company. - We had a growing ecosystem of software and services - We had a rich media side of the business that was growing and in social and various media distribution channels - We’re moving now into advanced capture - And with drones, entirely new categories - This all lends and leads to the current Big Data landscape that we have today. So, we brought together the a team of bad assess for companies like LinkedIn, Apple, Oracle, and Splice Machine to tackle the problem Thus formed the Data Science and Engineering team at GoPro
What does Data Science and Engineering look like at GoPro? The team is broken into 4 areas: Data Architecture and Data Operations Data Engineering Dev Ops Project Management Analytics is a different organization in GoPro There are a number of teams that are building domain specific data science expertise in addition to our team.
To set the tone a bit we have to take a moment and talk about our corporate values We take these values seriously and applied them to what we are doing in Data Science and Engineering. [Read the list, mention the fact that ass is in there twice.] So the stage is set. We have our tasks. Build out a big data platform and haul ass! Well as you’ve seen already, the company has been hauling ass to delivery an amazing ecosystem for our cameras, the latest entrant of which is our GoPro Desktop Application.
The GoPro App for desktop is the easiest way to offload and enjoy your GoPro photos and videos. Automatically offload your footage and keep everything organized in one place, so you can find your best shots fast. Make quick edits and share your favorite photos and videos straight to Facebook and YouTubeTM, or use the bundled GoPro Studio app for more advanced editing, including GoPro templates, slow-motion effects and more. Of course with it’s release we were immediately interested in understanding popularity and feature usage patterns. Through our platform, and with the use of Tableau, our partners in the analytics organization were able to put together multiple views that exposed several KPIs as well as began to lay out some preliminary insights into the features that resonated the most with our community
Unfortunately we can’t show you what those numbers are, but suffice it to say the reporting for the application really did come together quite quickly and continues to rapidly evolve as we iterate through views into our KPIs that resonate with our decision makers. So the question is then: how did all this come together?
Magic. That’s how we did it. So much magic that we call our platform the philosopher’s stone. The benefit of that name is that it abbreviates to “TPS” so that we can write TPS reports. And pester people about cover sheets.
Joke about extreme big data engineering at GoPro… A word about Data Sources: IoT play Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc. Some Raw and Gzip, Some Binary and JSON Some streaming and some batch Today, we have 3 clusters to isolate workloads GREEN ARROW: Point to the clusters We started with one cluster, ETL Everything ran there Ingest (Flume) Batch (Framework) ETL (Hive) Analytical (Impala) Lots of resource contention (I/O, memory, cores) To alleviate the resource contention, we opted for 3 clusters to isolate the workloads. Ingest cluster for near real-time streaming Kafka, Spark Streaming (Cloudera Parcels) Input: Logs, Output: JSON Minutes cadence Moving towards more real-time in seconds Induction framework for scheduled batch ingestion ETL cluster for heavy duty aggregation Input: JSON flat files, Output: Aggregated Parquet files Hive (Map/Reduce) Hourly cadence Secure Data Mart Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers) Input: Compressed Parquet files Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio) With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future. Kudu is one possible new technology that could help us to consolidate some of the clusters.
Let’s take a deeper dive into our streaming ingestion… Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS Custom servlet pushes logs into Kafka topics by environment A series of Spark streaming jobs process the logs from Kafka Landing place in ingestion cluster is HDFS with JSON flat files Rationalization of tech stacks… Why Kafka? Unrivaled write throughput for a queue Traditional queue throughput: 100K writes/sec on the biggest box you can buy Kafka throughput: 1M writes/sec on 3-4 commodity servers Strong ordering policy of messages Distributed Fault-tolerant through replication Support synchronous and asynchronous writes Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks) Why Spark Streaming? Strong transactional semantics - "exactly once" processing Leverage Spark technology for both data ingest and analytics Horizontally scalable - High throughput for micro-batching Large open source community
As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events Vary significantly in size from < 1 KB to > 1 MB Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes There are generic jobs/services and specialized jobs/services Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB) Specialized services contain business logic Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data) Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
On the ETL cluster… Here’s where we do our heavy lifting. Almost entirely all Hive Map Reduce jobs Some Impala to make the really big narly aggregations more performant Previously, had a custom Java Map Reduce job for sessionization of events This has been replaced with a Spark Streaming job on the ingestion cluster In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.) The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore. The Parquet files are then copied via distcp to the Secure Data Mart.
Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart. The Secure Data Mart is protected with Apache Sentry. Kerberos is used for authentication.  Corporate Standard Active Directory stores the groups.  Conrporate Standard Access control is role based and the roles are assigned with Sentry. Hue has a Sentry UI app to manage authorization. Hand off to Josh… Josh: From our secure data mart we are able to leverage the ODBC connectivity that Tableau has to Cloudera to visualize data in Tableau. Our governance structure in Tableau server allows analysts to iterate quickly through views and test those views in the browser in a staging location before publishing to a larger audience in a “production” folder for that business area. Trifacts is also present in this layer and plays a role in our team’s effort to move quickly
Speak to Trifacta usage
Pulling it all together our team has been successful in powering day 0 analytics that allow a very broad range of flexibility to the business [riff more on our platform]

Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

Similar to Extreme Sports & Beyond: Exploring a new frontier in data with GoPro (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

Editor's Notes