Apache con big data 2015 magellan

Magellan: Geospatial Analytics on Spark
Ram Sriharsha
Spark and Data Science Architect,
Hortonworks

Agenda
• Geospatial Analytics
• Geospatial Data Formats
• Challenges
• Magellan
• Spark SQL and Catalyst: An Intro
• How does Magellan use Spark SQL?
• Demo
• Q & A

Geospatial Context is useful
Where do people go on weekends?
Does usage pattern change with time?
Predict the drop off point of a user?
Predict the location where next pick up can be expected?
Identify crime hotspots
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes

Geospatial Data is pervasive

Obscure Data Formats
• ESRI Shapefile format
–SHP
–SHX
–DBX (Dbase File Format, very few legitimate parsers exist)
• GeoJSON
• Open Source Parsers exist (GIS 4 Hadoop)

Why do we need a proper framework?
• No standardized way of dealing with data and metadata
• Esoteric Data Formats and Coordinate Systems
• No optimizations for efficient joins
• APIs are too low level
• Language integration simplifies exploratory analytics
• Commonly used algorithms can be made available at scale
–Map matching
–Geohash indices
–Markov models

What do we want to support?
• Parse geospatial data and metadata into Shapes + Metadata Map
• Python and Scala support
• Geometric Queries
–efficiently!
–simple and intuitive syntax!
• Scalable implementations of common algorithms
–Map Matching
–Geohash Indexing
–Spatial Joins

Where are we at?
• Magellan available on Github (https://github.com/harsha2010/magellan)
• Can parse and understand most widely used formats
–GeoJSON, ESRIShapefile
• All geometries supported
• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)
• Broadcast join available for common scenarios
• Work in progress (targeted 1.0.4)
–Geohash Join optimization
–Map Matching algorithm using Markov Models
• Python and Scala support
• Please give it a try and give us feedback!

Geospatial Data Structures

Operations
• intersection
• union
• Symmetrical difference
• Difference
• Convex hull

Queries
• contains (within)
• Covers (covered-by)
• intersects
• touches

Geospatial Queries
Is a triplet of points (A, B, C) Clockwise
or Counter Clockwise ordered?

Geospatial Queries
Ray Tracing Algorithm:
Draw a ray from point and count
the # of times it intersects polygon

Accelerating Spatial Queries
• Bounding Box
• Geohashing
• R-Tree (and other) indices

Magellan
• Basic Abstraction in terms of Shape
–Point, PolyLine, Polygon
–Supports multiple implementations (currently uses ESRI-Java-API)
• SQL Data Type = Shape
–Efficient: Construct once and use
• Operations supported as SQL operations
–within, intersects, contains etc.
• Allows efficient Join implementations using Catalyst
–Broadcast join already available
–Geohash based join algorithm in progress

Why spark?
• DataFrames
–Intuitive manipulation of distributed structured data
• Catalyst Optimizer
–Push predicates to Data Source, allows optimized filters
• Memory Optimized Execution Engine

The Spark ecosystem
Spark Core
Spark
SQL
Spark
Streaming
ML-Lib GraphX
Distributed Compute Engine
• Speed, ease of use and fast prototyping
• Open source
• Powerful abstractions
• Python, R , Scala, Java support

Spark DataFrames are intuitive
RDD
DataFrame
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61

Spark DataFrames are fast!

Spark SQL under the covers

Catalyst
• Rows, Data Types
• Expressions
• Operators
• Rules
• Optimization

Rows and Data Types
• Standard SQL Data Types
–Date, Int, Long, String, etc
• Complex Data Types
–Array, Map, Struct, etc
• Custom Data Types
• Row = Collection of Data Types
–Represents a single row

Expressions
• Literals
• Arithmetic Expressions
–maxOf, unaryMinus
• Predicates
–Not, and, in, case when
• Cast
• String Expressions
–substring, like, startsWith

Optimizations
• Constant Folding
• Predicate Pushdown
–Combine Filters
–Push Predicate Through Join
• Null Propagation
• Boolean Simplification

Execution Engine
• Data Sources to read data into Data Frames
–Supports extending pushdowns to data store
• Optimized in memory layout
–ORC, Tungsten etc.
• Spark Strategies
–Convert logical plan -> physical plan
–Rules based on statistics

How does Magellan use Catalyst?
• Custom Data Source
–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs
–Returns a DataFrame with columns (point, polygon, polyline, metadata)
–Overrides PrunedFilteredScan
–Outputs Shape instances
• Custom Data Type
–Point, Polygon, Polyline instances of Shape
–Each Shape has a python counterpart.
–Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)
• Magellan Context
–Overrides Spark Planner allowing custom join implementations
• Python wrappers

Leveraging Data Sources

Leveraging Catalyst
Binary Expression

Leveraging Catalyst
Spatial Join + Predicate Pushdown

• Adds Custom Python Data Types
–Point, PolyLine, Polygon wrappers around Scala Data Types
• Wraps coordinate transformations and expressions
• Custom Picklers and Unpicklers
–Serialize to and from Scala

Future Work
• Geohash Indices
• Spatial Join Optimization
• Map Matching Algorithms
• Improving pyspark bindings

Map Matching
• Given sequence of points (representing a trip?) what was the road path taken?
• Challenges
–Error in GPS measurements
–Error in coordinate projections
–Time Gap between measurements, cannot just snap to nearest road

Apache con big data 2015 magellan

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Apache con big data 2015 magellan

Semelhante a Apache con big data 2015 magellan (20)

Último

Último (20)

Apache con big data 2015 magellan

Notas do Editor