SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.
In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.
Large Scale Geospatial Indexing and Analysis on Apache Spark
1. The Source of Truth for Physical Places
Felix Cheung, VP Eng
Large Scale Geospatial Indexing and Analysis on Apache Spark
2. About me
- VPE at SafeGraph
- ex-Uber - Data Platform teams
- Apache Software Foundation: Member, part of PMC
for Apache Spark, Apache Zeppelin, Apache Superset,
Apache Incubator
- Mentor of Apache Sedona (incubating)
3. Agenda
- Intro to geospatial data
- Distributed processing
- Use cases
- Overall architecture
5. We power innovation through open access to geospatial data.
We believe data should be an open platform, not a trade secret.
SafeGraph is just a data company
Fully Remote Founded 2016 Founders have deep
experience with
data and privacy
Previous company was
LiveRamp NYSE:RAMP
Data Scientists, Data
Engineers and Data
Business Experts
6. We power innovation through open access to geospatial data.
We believe data should be an open platform.
SafeGraph is just a data company
Our Mission:
The Source of Truth for Physical Places
7. ● Accurate and aggregated foot-traffic
data, derived from panel of MM
anonymized devices
● 8+ MM Points-of-Interest
● Easy to use, download as CSVs
SafeGraph Patterns Provides a Powerful Window
Into Consumer Behavior
Please see the Places schema & summary statistics for a complete list of attributes and coverage.
8. SafeGraph Products:
The source of truth for physical places
Core Places Geometry Patterns
Join on Placekey
Available for 8+ MM POI. Available for 8+ MM POI. Available for ~4.5MM POI.
9. Trade
Area
Retail &
Real Estate
Common Use Cases with SafeGraph Data
Marketing &
Advertising
Visit
Attribution
Location-
Based Ads
Geospatial
Analytics
Private Equity
Due Diligence
Site
Selection
Trade
Area
Mapping &
GIS Software
GIS
Services
Public
Equities
The Source of Truth for Physical Places
Financial Services &
Investment Research
10. What is geospatial data?
- Geospatial describes data that represents features or
objects on the Earth's surface.
- Records in a dataset have locational information tied
to them such as coordinates, address, city, or postal
code
- Often around what/who on where - demographic
11. Key challenges
- Earth’s surface area is 196.9 million mi²
- Computing “where is it” can be expensive
- Scaling such computation is a constant challenge
- Lack of truthset
- “The real world”
14. Common toolsets and frameworks - Limits
- Single machine
- New approaches:
- Parallel execution
- GPU acceleration
15. Apache Sedona (incubating) intro
- Started as GeoSpark, 2015 at Arizona State University
- A cluster computing system for processing
large-scale spatial data, by extending Apache Spark
- Distributed execution
18. Key advances
- Spatial SQL - spatial query
- Spatial Index
- Spatial Partitioning
2x-10x faster
50% reduction to peak memory consumption
… than other Spark-based geospatial systems
19. Spatial SQL
- Ease of Use
- Open Standards - SQL/MM Spatial 3
OGC Simple Features for SQL
- Geometry data types: point, line, multiline, polygon…
- Relationships between geometry data types
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham'
23. Spatial Indexes
- R-Tree, Quad-Tree
- Local Performance
in spatial range query,
area 1% - 16%
Jia Yu, ApacheCon 2019
24. Spatial Partitioning
- Partitioning - essential to distributed processing
- Strategy: by spatial proximity
- Step 1: random sample
- Step 2: build tree
- Step 3: leaf nodes -> global partitioning
25. Spatial Partitioning
- Uniform grids, Quad-Tree, KDB-Tree, R-Tree, Voronoi
diagram, Hilbert curve
Xie, Dong, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. "Simba: Efficient in-memory spatial analytics." In Proceedings of the 2016 International Conference on
Management of Data, pp. 1071-1085. ACM, 2016.
26. Spatial Partitioning + Indexing
- Distributed spatial indexing
- Global index - same tree in partitioning - bounding boxes
- Local index
Driver
27. Spatial Partitioning + Indexing
- Distributed hierarchical spatial indexing
- Global index - same tree in partitioning - bounding boxes
- Local index
Driver
Executor
Executor
Executor
28. What is H3?
- Geospatial indexing system, a multi-precision
hexagonal tiling of the sphere indexed with
hierarchical linear indexes
- Created at Uber, opened-source
https://h3geo.org/
29. Why H3?
- Geospatial analysis can be by bucketing locations
- Equidistant
- Traversal, neighboring, truncation
- Polyfill (region)
- Unidirectional edge
https://eng.uber.com/h3/