SlideShare uma empresa Scribd logo
1 de 42
Page1
Magellan: Geospatial Analytics on Spark
Ram Sriharsha
Spark and Data Science Architect,
Hortonworks
Page2
Agenda
• Geospatial Analytics
• Geospatial Data Formats
• Challenges
• Magellan
• Spark SQL and Catalyst: An Intro
• How does Magellan use Spark SQL?
• Demo
• Q & A
Page3
Geospatial Context is useful
Where do people go on weekends?
Does usage pattern change with time?
Predict the drop off point of a user?
Predict the location where next pick up can be expected?
Identify crime hotspots
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
Page4
Geospatial Data is pervasive
Page5
Obscure Data Formats
• ESRI Shapefile format
–SHP
–SHX
–DBX (Dbase File Format, very few legitimate parsers exist)
• GeoJSON
• Open Source Parsers exist (GIS 4 Hadoop)
Page6
Why do we need a proper framework?
• No standardized way of dealing with data and metadata
• Esoteric Data Formats and Coordinate Systems
• No optimizations for efficient joins
• APIs are too low level
• Language integration simplifies exploratory analytics
• Commonly used algorithms can be made available at scale
–Map matching
–Geohash indices
–Markov models
Page7
What do we want to support?
• Parse geospatial data and metadata into Shapes + Metadata Map
• Python and Scala support
• Geometric Queries
–efficiently!
–simple and intuitive syntax!
• Scalable implementations of common algorithms
–Map Matching
–Geohash Indexing
–Spatial Joins
Page8
Where are we at?
• Magellan available on Github (https://github.com/harsha2010/magellan)
• Can parse and understand most widely used formats
–GeoJSON, ESRIShapefile
• All geometries supported
• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)
• Broadcast join available for common scenarios
• Work in progress (targeted 1.0.4)
–Geohash Join optimization
–Map Matching algorithm using Markov Models
• Python and Scala support
• Please give it a try and give us feedback!
Page9
Geospatial Data Structures
Page10
Operations
• intersection
• union
• Symmetrical difference
• Difference
• Convex hull
Page11
Queries
• contains (within)
• Covers (covered-by)
• intersects
• touches
Page12
Geospatial Queries
Is a triplet of points (A, B, C) Clockwise
or Counter Clockwise ordered?
Page13
Geospatial Queries
Ray Tracing Algorithm:
Draw a ray from point and count
the # of times it intersects polygon
Page14
Accelerating Spatial Queries
• Bounding Box
• Geohashing
• R-Tree (and other) indices
Page15
Magellan
• Basic Abstraction in terms of Shape
–Point, PolyLine, Polygon
–Supports multiple implementations (currently uses ESRI-Java-API)
• SQL Data Type = Shape
–Efficient: Construct once and use
• Operations supported as SQL operations
–within, intersects, contains etc.
• Allows efficient Join implementations using Catalyst
–Broadcast join already available
–Geohash based join algorithm in progress
Page16
load
Page17
query metadata
Page18
geometric query
Page19
Python API
Page20
Python API, join
Page21
Why spark?
• DataFrames
–Intuitive manipulation of distributed structured data
• Catalyst Optimizer
–Push predicates to Data Source, allows optimized filters
• Memory Optimized Execution Engine
Page22
The Spark ecosystem
Spark Core
Spark
SQL
Spark
Streaming
ML-Lib GraphX
Distributed Compute Engine
• Speed, ease of use and fast prototyping
• Open source
• Powerful abstractions
• Python, R , Scala, Java support
Page23
Spark DataFrames are intuitive
RDD
DataFrame
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Page24
Spark DataFrames are fast!
Page25
Spark SQL under the covers
Page26
Catalyst
• Rows, Data Types
• Expressions
• Operators
• Rules
• Optimization
Page27
Rows and Data Types
• Standard SQL Data Types
–Date, Int, Long, String, etc
• Complex Data Types
–Array, Map, Struct, etc
• Custom Data Types
• Row = Collection of Data Types
–Represents a single row
Page28
Expressions
• Literals
• Arithmetic Expressions
–maxOf, unaryMinus
• Predicates
–Not, and, in, case when
• Cast
• String Expressions
–substring, like, startsWith
Page29
Optimizations
• Constant Folding
• Predicate Pushdown
–Combine Filters
–Push Predicate Through Join
• Null Propagation
• Boolean Simplification
Page30
Execution Engine
• Data Sources to read data into Data Frames
–Supports extending pushdowns to data store
• Optimized in memory layout
–ORC, Tungsten etc.
• Spark Strategies
–Convert logical plan -> physical plan
–Rules based on statistics
Page31
How does Magellan use Catalyst?
• Custom Data Source
–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs
–Returns a DataFrame with columns (point, polygon, polyline, metadata)
–Overrides PrunedFilteredScan
–Outputs Shape instances
• Custom Data Type
–Point, Polygon, Polyline instances of Shape
–Each Shape has a python counterpart.
–Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)
• Magellan Context
–Overrides Spark Planner allowing custom join implementations
• Python wrappers
Page32
Leveraging Data Sources
Page33
Leveraging Catalyst
Binary Expression
Page34
Leveraging Catalyst
Spatial Join + Predicate Pushdown
Page35
Python Bindings
Page36
• Adds Custom Python Data Types
–Point, PolyLine, Polygon wrappers around Scala Data Types
• Wraps coordinate transformations and expressions
• Custom Picklers and Unpicklers
–Serialize to and from Scala
Page37
Future Work
• Geohash Indices
• Spatial Join Optimization
• Map Matching Algorithms
• Improving pyspark bindings
Page38
geohashing
Page39
Base 32 encoding
Page40
geohashing
Page41
Map Matching
• Given sequence of points (representing a trip?) what was the road path taken?
• Challenges
–Error in GPS measurements
–Error in coordinate projections
–Time Gap between measurements, cannot just snap to nearest road
Page42
Demo
Uber use case?

Mais conteúdo relacionado

Mais procurados

Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Databricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approachPavel Mezentsev
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 

Mais procurados (20)

Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Apache Giraph
Apache GiraphApache Giraph
Apache Giraph
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 

Semelhante a Apache con big data 2015 magellan

Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph SchemaJoshua Shinavier
 
Project Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social NetworksProject Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social Networksamirhhz
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsAhmad Jawwad
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017Kisung Kim
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-datageoknow
 
ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical datageoknow
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 

Semelhante a Apache con big data 2015 magellan (20)

Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
Project Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social NetworksProject Progress Report - Recommender Systems for Social Networks
Project Progress Report - Recommender Systems for Social Networks
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data Systems
 
Day 6 - PostGIS
Day 6 - PostGISDay 6 - PostGIS
Day 6 - PostGIS
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
 
Openstreetmap
OpenstreetmapOpenstreetmap
Openstreetmap
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-data
 
ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical data
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Data structures
Data structuresData structures
Data structures
 

Último

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Último (20)

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Apache con big data 2015 magellan

  • 1. Page1 Magellan: Geospatial Analytics on Spark Ram Sriharsha Spark and Data Science Architect, Hortonworks
  • 2. Page2 Agenda • Geospatial Analytics • Geospatial Data Formats • Challenges • Magellan • Spark SQL and Catalyst: An Intro • How does Magellan use Spark SQL? • Demo • Q & A
  • 3. Page3 Geospatial Context is useful Where do people go on weekends? Does usage pattern change with time? Predict the drop off point of a user? Predict the location where next pick up can be expected? Identify crime hotspots How do these hotspots evolve with time? Predict the likelihood of crime occurring at a given neighborhood Predict climate at fairly granular level Climate insurance: do I need to buy insurance for my crops? Climate as a factor in crime: Join climate dataset with Crimes
  • 5. Page5 Obscure Data Formats • ESRI Shapefile format –SHP –SHX –DBX (Dbase File Format, very few legitimate parsers exist) • GeoJSON • Open Source Parsers exist (GIS 4 Hadoop)
  • 6. Page6 Why do we need a proper framework? • No standardized way of dealing with data and metadata • Esoteric Data Formats and Coordinate Systems • No optimizations for efficient joins • APIs are too low level • Language integration simplifies exploratory analytics • Commonly used algorithms can be made available at scale –Map matching –Geohash indices –Markov models
  • 7. Page7 What do we want to support? • Parse geospatial data and metadata into Shapes + Metadata Map • Python and Scala support • Geometric Queries –efficiently! –simple and intuitive syntax! • Scalable implementations of common algorithms –Map Matching –Geohash Indexing –Spatial Joins
  • 8. Page8 Where are we at? • Magellan available on Github (https://github.com/harsha2010/magellan) • Can parse and understand most widely used formats –GeoJSON, ESRIShapefile • All geometries supported • 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan) • Broadcast join available for common scenarios • Work in progress (targeted 1.0.4) –Geohash Join optimization –Map Matching algorithm using Markov Models • Python and Scala support • Please give it a try and give us feedback!
  • 10. Page10 Operations • intersection • union • Symmetrical difference • Difference • Convex hull
  • 11. Page11 Queries • contains (within) • Covers (covered-by) • intersects • touches
  • 12. Page12 Geospatial Queries Is a triplet of points (A, B, C) Clockwise or Counter Clockwise ordered?
  • 13. Page13 Geospatial Queries Ray Tracing Algorithm: Draw a ray from point and count the # of times it intersects polygon
  • 14. Page14 Accelerating Spatial Queries • Bounding Box • Geohashing • R-Tree (and other) indices
  • 15. Page15 Magellan • Basic Abstraction in terms of Shape –Point, PolyLine, Polygon –Supports multiple implementations (currently uses ESRI-Java-API) • SQL Data Type = Shape –Efficient: Construct once and use • Operations supported as SQL operations –within, intersects, contains etc. • Allows efficient Join implementations using Catalyst –Broadcast join already available –Geohash based join algorithm in progress
  • 21. Page21 Why spark? • DataFrames –Intuitive manipulation of distributed structured data • Catalyst Optimizer –Push predicates to Data Source, allows optimized filters • Memory Optimized Execution Engine
  • 22. Page22 The Spark ecosystem Spark Core Spark SQL Spark Streaming ML-Lib GraphX Distributed Compute Engine • Speed, ease of use and fast prototyping • Open source • Powerful abstractions • Python, R , Scala, Java support
  • 23. Page23 Spark DataFrames are intuitive RDD DataFrame dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61
  • 26. Page26 Catalyst • Rows, Data Types • Expressions • Operators • Rules • Optimization
  • 27. Page27 Rows and Data Types • Standard SQL Data Types –Date, Int, Long, String, etc • Complex Data Types –Array, Map, Struct, etc • Custom Data Types • Row = Collection of Data Types –Represents a single row
  • 28. Page28 Expressions • Literals • Arithmetic Expressions –maxOf, unaryMinus • Predicates –Not, and, in, case when • Cast • String Expressions –substring, like, startsWith
  • 29. Page29 Optimizations • Constant Folding • Predicate Pushdown –Combine Filters –Push Predicate Through Join • Null Propagation • Boolean Simplification
  • 30. Page30 Execution Engine • Data Sources to read data into Data Frames –Supports extending pushdowns to data store • Optimized in memory layout –ORC, Tungsten etc. • Spark Strategies –Convert logical plan -> physical plan –Rules based on statistics
  • 31. Page31 How does Magellan use Catalyst? • Custom Data Source –Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs –Returns a DataFrame with columns (point, polygon, polyline, metadata) –Overrides PrunedFilteredScan –Outputs Shape instances • Custom Data Type –Point, Polygon, Polyline instances of Shape –Each Shape has a python counterpart. –Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back) • Magellan Context –Overrides Spark Planner allowing custom join implementations • Python wrappers
  • 36. Page36 • Adds Custom Python Data Types –Point, PolyLine, Polygon wrappers around Scala Data Types • Wraps coordinate transformations and expressions • Custom Picklers and Unpicklers –Serialize to and from Scala
  • 37. Page37 Future Work • Geohash Indices • Spatial Join Optimization • Map Matching Algorithms • Improving pyspark bindings
  • 41. Page41 Map Matching • Given sequence of points (representing a trip?) what was the road path taken? • Challenges –Error in GPS measurements –Error in coordinate projections –Time Gap between measurements, cannot just snap to nearest road

Notas do Editor

  1. Point in Polygon Convex Hull Does Shape X (intersect) Shape Y Contain cover touch Buffer Shape X by a certain amount to produce Y
  2. Explain each component at high level