Large Scale Geospatial Indexing and Analysis on Apache Spark

The Source of Truth for Physical Places
Felix Cheung, VP Eng
Large Scale Geospatial Indexing and Analysis on Apache Spark

About me
- VPE at SafeGraph
- ex-Uber - Data Platform teams
- Apache Software Foundation: Member, part of PMC
for Apache Spark, Apache Zeppelin, Apache Superset,
Apache Incubator
- Mentor of Apache Sedona (incubating)

Agenda
- Intro to geospatial data
- Distributed processing
- Use cases
- Overall architecture

We power innovation through open access to geospatial data.
We believe data should be an open platform, not a trade secret.
SafeGraph is just a data company
Fully Remote Founded 2016 Founders have deep
experience with
data and privacy
Previous company was
LiveRamp NYSE:RAMP
Data Scientists, Data
Engineers and Data
Business Experts

We power innovation through open access to geospatial data.
We believe data should be an open platform.
SafeGraph is just a data company
Our Mission:

● Accurate and aggregated foot-traffic
data, derived from panel of MM
anonymized devices
● 8+ MM Points-of-Interest
● Easy to use, download as CSVs
SafeGraph Patterns Provides a Powerful Window
Into Consumer Behavior
Please see the Places schema & summary statistics for a complete list of attributes and coverage.

SafeGraph Products:
The source of truth for physical places
Core Places Geometry Patterns
Join on Placekey
Available for 8+ MM POI. Available for 8+ MM POI. Available for ~4.5MM POI.

Trade
Area
Retail &
Real Estate
Common Use Cases with SafeGraph Data
Marketing &
Advertising
Visit
Attribution
Location-
Based Ads
Geospatial
Analytics
Private Equity
Due Diligence
Site
Selection
Trade
Area
Mapping &
GIS Software
GIS
Services
Public
Equities
Financial Services &
Investment Research

What is geospatial data?
- Geospatial describes data that represents features or
objects on the Earth's surface.
- Records in a dataset have locational information tied
to them such as coordinates, address, city, or postal
code
- Often around what/who on where - demographic

Key challenges
- Earth’s surface area is 196.9 million mi²
- Computing “where is it” can be expensive
- Scaling such computation is a constant challenge
- Lack of truthset
- “The real world”

Common toolsets and frameworks

Common toolsets and frameworks - Limits
- Single machine
- New approaches:
- Parallel execution
- GPU acceleration

Apache Sedona (incubating) intro
- Started as GeoSpark, 2015 at Arizona State University
- A cluster computing system for processing
large-scale spatial data, by extending Apache Spark
- Distributed execution

Apache Sedona (incubating) intro
- Core/RDD
- Spatial SQL - spatial query
- Complex geometries / trajectories
- Spatial Index
- Spatial Partitioning
- Coordinate Reference System
- High resolution map generation

Key advances
- Spatial SQL - spatial query
- Spatial Index
- Spatial Partitioning
2x-10x faster
50% reduction to peak memory consumption
… than other Spark-based geospatial systems

Spatial SQL
- Ease of Use
- Open Standards - SQL/MM Spatial 3
OGC Simple Features for SQL
- Geometry data types: point, line, multiline, polygon…
- Relationships between geometry data types
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham'

Spatial Query Optimization
- Range Query
- Join Query
- KNN
- KNN Join
- Optimized Spatial Join Strategy

Data format
- Geospatial formats: WKT, WKB, GeoJSON, Shapeﬁle,
HDF…
- Geospatial geometries
POLYGON ((-97.019...
POINT (-88.331492 32.324142)

Spatial Indexes
- R-Tree, Quad-Tree
https://en.wikipedia.org/wiki/R-tree

Spatial Indexes
- R-Tree, Quad-Tree
- Local Performance
in spatial range query,
area 1% - 16%
Jia Yu, ApacheCon 2019

Spatial Partitioning
- Partitioning - essential to distributed processing
- Strategy: by spatial proximity
- Step 1: random sample
- Step 2: build tree
- Step 3: leaf nodes -> global partitioning

Spatial Partitioning
- Uniform grids, Quad-Tree, KDB-Tree, R-Tree, Voronoi
diagram, Hilbert curve
Xie, Dong, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. "Simba: Efficient in-memory spatial analytics." In Proceedings of the 2016 International Conference on
Management of Data, pp. 1071-1085. ACM, 2016.

Spatial Partitioning + Indexing
- Distributed spatial indexing
- Global index - same tree in partitioning - bounding boxes
- Local index
Driver

Spatial Partitioning + Indexing
- Distributed hierarchical spatial indexing
- Global index - same tree in partitioning - bounding boxes
- Local index
Driver
Executor
Executor
Executor

What is H3?
- Geospatial indexing system, a multi-precision
hexagonal tiling of the sphere indexed with
hierarchical linear indexes
- Created at Uber, opened-source
https://h3geo.org/

Why H3?
- Geospatial analysis can be by bucketing locations
- Equidistant
- Traversal, neighboring, truncation
- Polyﬁll (region)
- Unidirectional edge
https://eng.uber.com/h3/

Why H3?
- Truncation
- h3ToParent
- kRing

H3 - basis of Placekey
- Universal identiﬁer for physical places
- eg. handle address mismatches..
https://www.placekey.io/

Use Case 1 - Visit Attribution
https://www.safegraph.com/visit-attribution

Use Case 1 - Visit Attribution
1. Clustering
2. Spatial Join
3. Prediction

Use Case 1 - Visit Attribution - Implementation

Use Case 2 - Geometry Overlap
- Geometry processing - detect overlapping polygons
- Auto QA - automatic analysis at scale
- Analyzing geospatial distributions

Overall Architecture
Training
HITL Annotation
Auto QA
HITL QA

SafeGraph Blog
We are hiring!
safegraph.com/careers

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
We are hiring!
safegraph.com/careers

Large Scale Geospatial Indexing and Analysis on Apache Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Large Scale Geospatial Indexing and Analysis on Apache Spark

Semelhante a Large Scale Geospatial Indexing and Analysis on Apache Spark (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Large Scale Geospatial Indexing and Analysis on Apache Spark