2. Page2
Agenda
• Geospatial Analytics
• Geospatial Data Formats
• Challenges
• Magellan
• Spark SQL and Catalyst: An Intro
• How does Magellan use Spark SQL?
• Demo
• Q & A
3. Page3
Geospatial Context is useful
Where do people go on weekends?
Does usage pattern change with time?
Predict the drop off point of a user?
Predict the location where next pick up can be expected?
Identify crime hotspots
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
5. Page5
Obscure Data Formats
• ESRI Shapefile format
–SHP
–SHX
–DBX (Dbase File Format, very few legitimate parsers exist)
• GeoJSON
• Open Source Parsers exist (GIS 4 Hadoop)
6. Page6
Why do we need a proper framework?
• No standardized way of dealing with data and metadata
• Esoteric Data Formats and Coordinate Systems
• No optimizations for efficient joins
• APIs are too low level
• Language integration simplifies exploratory analytics
• Commonly used algorithms can be made available at scale
–Map matching
–Geohash indices
–Markov models
7. Page7
What do we want to support?
• Parse geospatial data and metadata into Shapes + Metadata Map
• Python and Scala support
• Geometric Queries
–efficiently!
–simple and intuitive syntax!
• Scalable implementations of common algorithms
–Map Matching
–Geohash Indexing
–Spatial Joins
8. Page8
Where are we at?
• Magellan available on Github (https://github.com/harsha2010/magellan)
• Can parse and understand most widely used formats
–GeoJSON, ESRIShapefile
• All geometries supported
• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)
• Broadcast join available for common scenarios
• Work in progress (targeted 1.0.4)
–Geohash Join optimization
–Map Matching algorithm using Markov Models
• Python and Scala support
• Please give it a try and give us feedback!
15. Page15
Magellan
• Basic Abstraction in terms of Shape
–Point, PolyLine, Polygon
–Supports multiple implementations (currently uses ESRI-Java-API)
• SQL Data Type = Shape
–Efficient: Construct once and use
• Operations supported as SQL operations
–within, intersects, contains etc.
• Allows efficient Join implementations using Catalyst
–Broadcast join already available
–Geohash based join algorithm in progress
21. Page21
Why spark?
• DataFrames
–Intuitive manipulation of distributed structured data
• Catalyst Optimizer
–Push predicates to Data Source, allows optimized filters
• Memory Optimized Execution Engine
22. Page22
The Spark ecosystem
Spark Core
Spark
SQL
Spark
Streaming
ML-Lib GraphX
Distributed Compute Engine
• Speed, ease of use and fast prototyping
• Open source
• Powerful abstractions
• Python, R , Scala, Java support
23. Page23
Spark DataFrames are intuitive
RDD
DataFrame
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
27. Page27
Rows and Data Types
• Standard SQL Data Types
–Date, Int, Long, String, etc
• Complex Data Types
–Array, Map, Struct, etc
• Custom Data Types
• Row = Collection of Data Types
–Represents a single row
28. Page28
Expressions
• Literals
• Arithmetic Expressions
–maxOf, unaryMinus
• Predicates
–Not, and, in, case when
• Cast
• String Expressions
–substring, like, startsWith
30. Page30
Execution Engine
• Data Sources to read data into Data Frames
–Supports extending pushdowns to data store
• Optimized in memory layout
–ORC, Tungsten etc.
• Spark Strategies
–Convert logical plan -> physical plan
–Rules based on statistics
31. Page31
How does Magellan use Catalyst?
• Custom Data Source
–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs
–Returns a DataFrame with columns (point, polygon, polyline, metadata)
–Overrides PrunedFilteredScan
–Outputs Shape instances
• Custom Data Type
–Point, Polygon, Polyline instances of Shape
–Each Shape has a python counterpart.
–Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)
• Magellan Context
–Overrides Spark Planner allowing custom join implementations
• Python wrappers
36. Page36
• Adds Custom Python Data Types
–Point, PolyLine, Polygon wrappers around Scala Data Types
• Wraps coordinate transformations and expressions
• Custom Picklers and Unpicklers
–Serialize to and from Scala
37. Page37
Future Work
• Geohash Indices
• Spatial Join Optimization
• Map Matching Algorithms
• Improving pyspark bindings
41. Page41
Map Matching
• Given sequence of points (representing a trip?) what was the road path taken?
• Challenges
–Error in GPS measurements
–Error in coordinate projections
–Time Gap between measurements, cannot just snap to nearest road