SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
#EntSAIS18
Exploring a Billion Activity Dataset from Athletes
with Apache Spark
Drew Robb, Infrastructure/Data at Strava
Strava Labs:
#EntSAIS18
1
#EntSAIS18 2
#EntSAIS18
Outline
1. Activity Stream Data “Lake”
2. Grade Adjusted Pace
3. Global Heatmap
a. Rasterize
b. Normalize
c. Recursion/Zoom out
d. Store+Serve
4. Clusterer
5. Final Thoughts
3
#EntSAIS18
- Activity Stream: Array of
measurement data over
time from a single
Run/Ride/etc
- Typically at 1/second rate
- Stream Types: Time,
Position (lat, lng),
elevation, heart rate,...
Data “Lake”
4
#EntSAIS18
- Billions of streams, each with thousands of points
- Canonical stream storage 1 file per stream in s3!
- Keyed by autoincrement id!!
- Over 10 billion total s3 files!
- Can’t actually bulk read-- too error prone, too expensive
- Need replica of data in a different format suitable for bulk
read with spark
Data “Lake”
5
#EntSAIS18
- Delta Encoding + Compression
- Parquet+S3 multi partition hive table
- ~100mb per s3 file (streams for ~30,000 activities)
- New data batch appended as new partitions
- Handle updates: Entire table periodically rewritten in
new s3 path, then update metastore
- Original data store: Cost: ~$5,000, time: 3 days
- Optimized for bulk read: Cost ~$5, time: 5 min
Data “Lake”
6
#EntSAIS18
- A synthetic running pace that reflects how fast
you would have run in the absence of hills
- Helps athletes compare performance of flat vs
hilly runs
- Requires precise understanding of how running
difficulty is affected by elevation gradient
Grade Adjusted Pace
7
#EntSAIS18
- Previous model: Derived from metabolic
measurements of a few athletes running on a
treadmill
- New model: Fit from actual running data drawn
from millions of runs in real world
- Find smooth data sample windows
- Only draw data from near maximum effort for
each athlete.
Grade Adjusted Pace
8
#EntSAIS18 9
#EntSAIS18 10
#EntSAIS18
Global Heatmap
- 700 million activities
- 1.4 trillion GPS points
- 10 billion miles
- 100,000 years of exercise
- 10TB input
- 500GB output
- 5 Layers by Activity Type
- https://strava.com/heatmap
11
#EntSAIS18 12
#EntSAIS18
Rasterize
13
#EntSAIS18
Web Mercator Tile Math
- Pixel coordinate system for Earth Surface.
- Each zoom level: 4x as many tiles at 2x resolution
- Each tile is 256x256 pixels
def fromPoint(lat, lng, zoom) = TilePixel(
zoom = zoom,
tileY = floor((lat+90 ) / (180 * 2^zoom)),
tileX = floor((lng+180) / (360 * 2^zoom)),
pixelY = floor((256 * (lat+90 ) / (180 * 2^zoom)) % 256),
pixelX = floor((256 * (lng+180) / (360 * 2^zoom)) % 256))
14
#EntSAIS18
Rasterize
- Convert vectorized lat/lng line data into
discrete pixel lines (Bresenham’s line
algorithm)
- Aggregate count of total passes of each
pixel
- Total pixel counts ~ 7 trillion!
- Total tile counts ~ 30 million
- Total memory load of tiles is 30TB, avoid
simultaneous allocation by having many
more tasks than cores
15
#EntSAIS18
Rasterize
Boundary case: adjacent points in different tiles
16
#EntSAIS18
Rasterize
Solution: Add virtual points on tile boundary
17
#EntSAIS18
Rasterize
- Add noise to
“uncorrect” GPS
road corrections
- Other filtering of
erroneous data
18
#EntSAIS18
Normalize
19
#EntSAIS18
Normalize
- Raw count of activity per pixel has domain [0, inf)
- To visualize with a colormap we need [0, 1]
- Tiles have dramatically different distributions of raw count values
- General and parameter free solution:
- Normalize using the Cumulative Distribution Function (CDF) of raw
count values within each tile
- Can approximate CDF with sampling (Sample 2^8 out of 2^16 values)
- Evaluate CDF with binary search in sample array
20
#EntSAIS18
Problem:
Normalization function is
different for every tile!
=> Artifacts along tile
boundaries
21
#EntSAIS18
Normalize
Solution:
- Averaging: Each tile’s CDF is calculated from data for
that tile + some radius of neighbor tiles
22
#EntSAIS18
Normalize
Solution:
- Averaging: Each tile’s CDF is calculated from data for
that tile + some radius of neighbor tiles
- Bilinear smoothing:
- Every pixel value is a weighted average of CDFs
from 4 nearest tiles
- Continuously varying pixel perfect normalization!
23
#EntSAIS18 24
#EntSAIS18 25
#EntSAIS18
Recursion /
Zoom Out
26
#EntSAIS18
Recursion / Zoom Out
27
#EntSAIS18
Recursion / Zoom Out
def finish(data, zoom) {
write(normalize(data), zoom)
if (zoom > 0) {
finish(zoomOut(data), zoom-1)
}
}
28
#EntSAIS18
Recursion / Zoom out
- Each iteration has ~1/4th as
much data
- Reduce number of tasks to
not waste time
- Exponential speedup of
stage runtime incredibly
satisfying
Level 16 written, 136928925282 bytes
Level 15 written, 53804156232 bytes
Level 14 written, 19795342663 bytes
Level 13 written, 6996622566 bytes
Level 12 written, 2474617135 bytes
Level 11 written, 897467724 bytes
Level 10 written, 331303516 bytes
Level 9 written, 122600962 bytes
Level 8 written, 44957727 bytes
Level 7 written, 16254495 bytes
Level 6 written, 5728247 bytes
Level 5 written, 1978297 bytes
Level 4 written, 673066 bytes
Level 3 written, 228565 bytes
Level 2 written, 78380 bytes
Level 1 written, 27417 bytes
29
#EntSAIS18
Store+Serve
30
#EntSAIS18
Store+Serve
Idea: Group spatially adjacent tiles to get similar file sizes.
Consider tile data as quad tree.
- Starting from all leafs, walk up tree accumulating total file size
at each node.
- When file size exceeds threshold, write all tiles as one file.
- Direct s3 write (no hadoopFS api)
- Repeat for each level of tile data.
31
#EntSAIS18
Tile Packing
Zoom 2:
-(1,0)
-(2,2)
Zoom 1:
-(0,2), (1,2), (0,3), (1,3)
-(3,2), (2,3), (3,3)
Zoom 0:
- Everything else
32
#EntSAIS18
Store+Serve
Reads:
- Reconstruct tree of tile groups in memory
- Traverse tree to find tile location
- Cache the tile group file-- nearby tiles also likely to be
served.
- Persisted data is 1 byte per pixel, colormap and PNG
applied at read time.
- Final rendered PNG also cached by cloudfront
33
#EntSAIS18
Clusterer
- Discover sets of
spatially similar activities
using clustering
- Compute patterns in
aggregated activity
properties
- Build catalog of routes
for every cluster (a few
million clusters)
34
#EntSAIS18
Clusterer
- Locality Sensitive Hashing (SuperMinHash O Ertl, 2017)
of quantized lat/lng shingles from activity streams
- MinHash pairwise similarity of activities: 10,000x
speedup over direct stream based methods
- Similarity join on collisions to efficiently find sample of
highly similar pairs
- Single linkage clustering globally on pairs
- Split locally and repeat, rehashing sample of activities in
each cluster
35
#EntSAIS18 36
#EntSAIS18
Spark Environment
- Spark on mesos on AWS
- Job submitted with Metronome/Marathon
- Isolated context with job-specific docker image for
driver+executors
- Mesos cluster running multiple spark contexts
- Scale up instances based on unscheduled tasks
- Prevent scale down on instances with active executors
- 100 i3.2xlarge machines (8cpu, 60gb ram, 1.7tb ssd)
- 5 hours for full heatmap build
37
#EntSAIS18
Spark Lessons
- Parameter tuning seems unavoidable in well utilized distributed systems
- With scala, managing application dependency conflicts can be hell (guava,
netty, logging frameworks…)
- Optimizing locally can be very misleading, must profile on real cluster
- Even with fantastic configuration management tools, lots of time wasted on
configuration
- Building from scratch may be worthwhile to fully understand performance
- Invest in generating representative sample data to use in prototypes
- Spilling to disk can use a surprising amount of IOPS
- Probably still a lot more optimizations possible
38
#EntSAIS18
Resources
Strava Global Heatmap: strava.com/heatmap
Strava Labs: labs.strava.com
Strava Engineering Blog: medium.com/strava-engineering
We are hiring Data Scientists: strava.com/careers
Strava Labs Clusterer*: labs.strava.com/clusterer
*Currently offline, will be back soon
39

Mais conteúdo relacionado

Mais procurados

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks
 

Mais procurados (20)

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 

Semelhante a Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb

Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
Geetanjali G
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
javicid
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 

Semelhante a Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb (20)

Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb

  • 1. #EntSAIS18 Exploring a Billion Activity Dataset from Athletes with Apache Spark Drew Robb, Infrastructure/Data at Strava Strava Labs: #EntSAIS18 1
  • 3. #EntSAIS18 Outline 1. Activity Stream Data “Lake” 2. Grade Adjusted Pace 3. Global Heatmap a. Rasterize b. Normalize c. Recursion/Zoom out d. Store+Serve 4. Clusterer 5. Final Thoughts 3
  • 4. #EntSAIS18 - Activity Stream: Array of measurement data over time from a single Run/Ride/etc - Typically at 1/second rate - Stream Types: Time, Position (lat, lng), elevation, heart rate,... Data “Lake” 4
  • 5. #EntSAIS18 - Billions of streams, each with thousands of points - Canonical stream storage 1 file per stream in s3! - Keyed by autoincrement id!! - Over 10 billion total s3 files! - Can’t actually bulk read-- too error prone, too expensive - Need replica of data in a different format suitable for bulk read with spark Data “Lake” 5
  • 6. #EntSAIS18 - Delta Encoding + Compression - Parquet+S3 multi partition hive table - ~100mb per s3 file (streams for ~30,000 activities) - New data batch appended as new partitions - Handle updates: Entire table periodically rewritten in new s3 path, then update metastore - Original data store: Cost: ~$5,000, time: 3 days - Optimized for bulk read: Cost ~$5, time: 5 min Data “Lake” 6
  • 7. #EntSAIS18 - A synthetic running pace that reflects how fast you would have run in the absence of hills - Helps athletes compare performance of flat vs hilly runs - Requires precise understanding of how running difficulty is affected by elevation gradient Grade Adjusted Pace 7
  • 8. #EntSAIS18 - Previous model: Derived from metabolic measurements of a few athletes running on a treadmill - New model: Fit from actual running data drawn from millions of runs in real world - Find smooth data sample windows - Only draw data from near maximum effort for each athlete. Grade Adjusted Pace 8
  • 11. #EntSAIS18 Global Heatmap - 700 million activities - 1.4 trillion GPS points - 10 billion miles - 100,000 years of exercise - 10TB input - 500GB output - 5 Layers by Activity Type - https://strava.com/heatmap 11
  • 14. #EntSAIS18 Web Mercator Tile Math - Pixel coordinate system for Earth Surface. - Each zoom level: 4x as many tiles at 2x resolution - Each tile is 256x256 pixels def fromPoint(lat, lng, zoom) = TilePixel( zoom = zoom, tileY = floor((lat+90 ) / (180 * 2^zoom)), tileX = floor((lng+180) / (360 * 2^zoom)), pixelY = floor((256 * (lat+90 ) / (180 * 2^zoom)) % 256), pixelX = floor((256 * (lng+180) / (360 * 2^zoom)) % 256)) 14
  • 15. #EntSAIS18 Rasterize - Convert vectorized lat/lng line data into discrete pixel lines (Bresenham’s line algorithm) - Aggregate count of total passes of each pixel - Total pixel counts ~ 7 trillion! - Total tile counts ~ 30 million - Total memory load of tiles is 30TB, avoid simultaneous allocation by having many more tasks than cores 15
  • 16. #EntSAIS18 Rasterize Boundary case: adjacent points in different tiles 16
  • 17. #EntSAIS18 Rasterize Solution: Add virtual points on tile boundary 17
  • 18. #EntSAIS18 Rasterize - Add noise to “uncorrect” GPS road corrections - Other filtering of erroneous data 18
  • 20. #EntSAIS18 Normalize - Raw count of activity per pixel has domain [0, inf) - To visualize with a colormap we need [0, 1] - Tiles have dramatically different distributions of raw count values - General and parameter free solution: - Normalize using the Cumulative Distribution Function (CDF) of raw count values within each tile - Can approximate CDF with sampling (Sample 2^8 out of 2^16 values) - Evaluate CDF with binary search in sample array 20
  • 21. #EntSAIS18 Problem: Normalization function is different for every tile! => Artifacts along tile boundaries 21
  • 22. #EntSAIS18 Normalize Solution: - Averaging: Each tile’s CDF is calculated from data for that tile + some radius of neighbor tiles 22
  • 23. #EntSAIS18 Normalize Solution: - Averaging: Each tile’s CDF is calculated from data for that tile + some radius of neighbor tiles - Bilinear smoothing: - Every pixel value is a weighted average of CDFs from 4 nearest tiles - Continuously varying pixel perfect normalization! 23
  • 28. #EntSAIS18 Recursion / Zoom Out def finish(data, zoom) { write(normalize(data), zoom) if (zoom > 0) { finish(zoomOut(data), zoom-1) } } 28
  • 29. #EntSAIS18 Recursion / Zoom out - Each iteration has ~1/4th as much data - Reduce number of tasks to not waste time - Exponential speedup of stage runtime incredibly satisfying Level 16 written, 136928925282 bytes Level 15 written, 53804156232 bytes Level 14 written, 19795342663 bytes Level 13 written, 6996622566 bytes Level 12 written, 2474617135 bytes Level 11 written, 897467724 bytes Level 10 written, 331303516 bytes Level 9 written, 122600962 bytes Level 8 written, 44957727 bytes Level 7 written, 16254495 bytes Level 6 written, 5728247 bytes Level 5 written, 1978297 bytes Level 4 written, 673066 bytes Level 3 written, 228565 bytes Level 2 written, 78380 bytes Level 1 written, 27417 bytes 29
  • 31. #EntSAIS18 Store+Serve Idea: Group spatially adjacent tiles to get similar file sizes. Consider tile data as quad tree. - Starting from all leafs, walk up tree accumulating total file size at each node. - When file size exceeds threshold, write all tiles as one file. - Direct s3 write (no hadoopFS api) - Repeat for each level of tile data. 31
  • 32. #EntSAIS18 Tile Packing Zoom 2: -(1,0) -(2,2) Zoom 1: -(0,2), (1,2), (0,3), (1,3) -(3,2), (2,3), (3,3) Zoom 0: - Everything else 32
  • 33. #EntSAIS18 Store+Serve Reads: - Reconstruct tree of tile groups in memory - Traverse tree to find tile location - Cache the tile group file-- nearby tiles also likely to be served. - Persisted data is 1 byte per pixel, colormap and PNG applied at read time. - Final rendered PNG also cached by cloudfront 33
  • 34. #EntSAIS18 Clusterer - Discover sets of spatially similar activities using clustering - Compute patterns in aggregated activity properties - Build catalog of routes for every cluster (a few million clusters) 34
  • 35. #EntSAIS18 Clusterer - Locality Sensitive Hashing (SuperMinHash O Ertl, 2017) of quantized lat/lng shingles from activity streams - MinHash pairwise similarity of activities: 10,000x speedup over direct stream based methods - Similarity join on collisions to efficiently find sample of highly similar pairs - Single linkage clustering globally on pairs - Split locally and repeat, rehashing sample of activities in each cluster 35
  • 37. #EntSAIS18 Spark Environment - Spark on mesos on AWS - Job submitted with Metronome/Marathon - Isolated context with job-specific docker image for driver+executors - Mesos cluster running multiple spark contexts - Scale up instances based on unscheduled tasks - Prevent scale down on instances with active executors - 100 i3.2xlarge machines (8cpu, 60gb ram, 1.7tb ssd) - 5 hours for full heatmap build 37
  • 38. #EntSAIS18 Spark Lessons - Parameter tuning seems unavoidable in well utilized distributed systems - With scala, managing application dependency conflicts can be hell (guava, netty, logging frameworks…) - Optimizing locally can be very misleading, must profile on real cluster - Even with fantastic configuration management tools, lots of time wasted on configuration - Building from scratch may be worthwhile to fully understand performance - Invest in generating representative sample data to use in prototypes - Spilling to disk can use a surprising amount of IOPS - Probably still a lot more optimizations possible 38
  • 39. #EntSAIS18 Resources Strava Global Heatmap: strava.com/heatmap Strava Labs: labs.strava.com Strava Engineering Blog: medium.com/strava-engineering We are hiring Data Scientists: strava.com/careers Strava Labs Clusterer*: labs.strava.com/clusterer *Currently offline, will be back soon 39