GoPro is a powerful global brand, thanks in large part to its innovative cameras and accessories that capture moments other cameras just miss: surfing in Maui, skiing in Tahoe, recording your child’s first steps. And today, the company is nearly as well known for its user-generated social and content networks.
Join us for this special webinar hosted by Tableau, Trifacta, and Cloudera—featuring GoPro. We’ll dive into GoPro’s data strategy and architecture, from ingest and processing to data prep and reporting, all on AWS.
When we got here a little over two years ago, all we did was sell cameras.
It was our job assess the data landscape, understand the roadmap, and to ultimately plan and implement an Enterprise Data Platform to support the company.
Here’s what we saw…
- Business was indeed growing, the product line was expanding in number and sophistication, BUT we were becoming more than a camera company.
- We had a growing ecosystem of software and services
- We had a rich media side of the business that was growing and in social and various media distribution channels
- We’re moving now into advanced capture
- And with drones, entirely new categories
- This all lends and leads to the current Big Data landscape that we have today.
So, we brought together the a team of bad assess for companies like LinkedIn, Apple, Oracle, and Splice Machine to tackle the problem
Thus formed the Data Science and Engineering team at GoPro
What does Data Science and Engineering look like at GoPro?
The team is broken into 4 areas:
Data Architecture and Data Operations
Data Engineering
Dev Ops
Project Management
Analytics is a different organization in GoPro
There are a number of teams that are building domain specific data science expertise in addition to our team.
To set the tone a bit we have to take a moment and talk about our corporate values
We take these values seriously and applied them to what we are doing in Data Science and Engineering.
[Read the list, mention the fact that ass is in there twice.]
So the stage is set. We have our tasks. Build out a big data platform and haul ass!
Well as you’ve seen already, the company has been hauling ass to delivery an amazing ecosystem for our cameras, the latest entrant of which is our GoPro Desktop Application.
The GoPro App for desktop is the easiest way to offload and enjoy your GoPro photos and videos. Automatically offload your footage and keep everything organized in one place, so you can find your best shots fast. Make quick edits and share your favorite photos and videos straight to Facebook and YouTubeTM, or use the bundled GoPro Studio app for more advanced editing, including GoPro templates, slow-motion effects and more.
Of course with it’s release we were immediately interested in understanding popularity and feature usage patterns.
Through our platform, and with the use of Tableau, our partners in the analytics organization were able to put together multiple views that exposed several KPIs as well as began to lay out some preliminary insights into the features that resonated the most with our community
Unfortunately we can’t show you what those numbers are, but suffice it to say the reporting for the application really did come together quite quickly and continues to rapidly evolve as we iterate through views into our KPIs that resonate with our decision makers.
So the question is then: how did all this come together?
Magic. That’s how we did it. So much magic that we call our platform the philosopher’s stone. The benefit of that name is that it abbreviates to “TPS” so that we can write TPS reports. And pester people about cover sheets.
Joke about extreme big data engineering at GoPro…
A word about Data Sources:
IoT play
Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc.
Some Raw and Gzip, Some Binary and JSON
Some streaming and some batch
Today, we have 3 clusters to isolate workloads
GREEN ARROW: Point to the clusters
We started with one cluster, ETL
Everything ran there
Ingest (Flume)
Batch (Framework)
ETL (Hive)
Analytical (Impala)
Lots of resource contention (I/O, memory, cores)
To alleviate the resource contention, we opted for 3 clusters to isolate the workloads.
Ingest cluster for near real-time streaming
Kafka, Spark Streaming (Cloudera Parcels)
Input: Logs, Output: JSON
Minutes cadence
Moving towards more real-time in seconds
Induction framework for scheduled batch ingestion
ETL cluster for heavy duty aggregation
Input: JSON flat files, Output: Aggregated Parquet files
Hive (Map/Reduce)
Hourly cadence
Secure Data Mart
Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers)
Input: Compressed Parquet files
Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio)
With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future.
Kudu is one possible new technology that could help us to consolidate some of the clusters.
Let’s take a deeper dive into our streaming ingestion…
Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS
Custom servlet pushes logs into Kafka topics by environment
A series of Spark streaming jobs process the logs from Kafka
Landing place in ingestion cluster is HDFS with JSON flat files
Rationalization of tech stacks…
Why Kafka?
Unrivaled write throughput for a queue
Traditional queue throughput: 100K writes/sec on the biggest box you can buy
Kafka throughput: 1M writes/sec on 3-4 commodity servers
Strong ordering policy of messages
Distributed
Fault-tolerant through replication
Support synchronous and asynchronous writes
Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks)
Why Spark Streaming?
Strong transactional semantics - "exactly once" processing
Leverage Spark technology for both data ingest and analytics
Horizontally scalable - High throughput for micro-batching
Large open source community
As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint
Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events
Vary significantly in size from < 1 KB to > 1 MB
Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job
Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic
Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes
There are generic jobs/services and specialized jobs/services
Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS
We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB)
Specialized services contain business logic
Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data)
Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
On the ETL cluster…
Here’s where we do our heavy lifting.
Almost entirely all Hive Map Reduce jobs
Some Impala to make the really big narly aggregations more performant
Previously, had a custom Java Map Reduce job for sessionization of events
This has been replaced with a Spark Streaming job on the ingestion cluster
In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing
We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.)
The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore.
The Parquet files are then copied via distcp to the Secure Data Mart.
Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart.
The Secure Data Mart is protected with Apache Sentry.
Kerberos is used for authentication. Corporate Standard
Active Directory stores the groups. Conrporate Standard
Access control is role based and the roles are assigned with Sentry.
Hue has a Sentry UI app to manage authorization.
Hand off to Josh…
Josh: From our secure data mart we are able to leverage the ODBC connectivity that Tableau has to Cloudera to visualize data in Tableau.
Our governance structure in Tableau server allows analysts to iterate quickly through views and test those views in the browser in a staging location before publishing to a larger audience in a “production” folder for that business area.
Trifacts is also present in this layer and plays a role in our team’s effort to move quickly
Speak to Trifacta usage
Pulling it all together our team has been successful in powering day 0 analytics that allow a very broad range of flexibility to the business
[riff more on our platform]