SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Magellan – Spark as a
Geospatial Analytics Engine
Ram Sriharsha
Hortonworks
Who Am I?
• Apache Spark PMC Member
• Hortonworks Architect, Spark + Data Science
• Prior to HWX, Principal Research Scientist @
Yahoo Labs (Large Scale Machine Learning)
– Login Risk Detection, Sponsored Search Click
Prediction, Advertising Effectiveness Models,
Online Learning, …
What is Geospatial Analytics?
How do pickup/ dropoff neighborhood
hotspots evolve with time?
Correct GPS errors with more
Accurate landmark measurements
Incorporate location in IR and search
advertising
Do we need one more library?
• Spatial Analytics at scale is challenging
– Simplicity + Scalability = Hard
• Ancient Data Formats
– metadata, indexing not handled well, inefficient storage
• Geospatial Analytics is not simply BI anymore
– Statistical + Machine Learning being leveraged in geospatial
• Now is the time to do it!
– Explosion of mobile data
– Finer granularity of data collection for geometries
– Analytics stretching the limits of traditional approaches
– Spark SQL + Catalyst + Tungsten makes extensible SQL engines easier
than ever before!
Introduction to Magellan
Introduction to Magellan
Introduction to Magellan
Shapefiles
*.shp
*.dbf
sqlContext.read.format(“magellan”).l
oad(${neighborhoods.path})
GeoJSON
*.json
sqlContext.read.format(“magellan”)
.option(“type”, “geojson”)
.load(${neighborhoods.path})
Introduction to Magellan
neighborhoods.filter(
point(-122.4111659, 37.8003388)
within
‘polygon
).show()
Shape literal
Boolean Expression
Introduction to Magellan
points.join(neighborhoods).
where(‘point within ‘polygon).
show()
Introduction to Magellan
neighborhoods.filter(
point(-122.4111659, 37.8003388).buffer(0.1)
intersects
‘polygon
).show()
‘point within ‘polygon
the join
• Inherits all join optimizations from Spark
SQL
– if neighborhoods table is small, Broadcast
Cartesian Join
– else Cartesian Join
Status
• Magellan 1.0.3 available as Spark Package.
• Scala
• Spark 1.4
• Spark Package: Magellan
• Github: https://github.com/harsha2010/magellan
• Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-
spark/
• Notebook example: http://bit.ly/1GwLyrV
• Input Formats: ESRI Shapefile, GeoJSON, OSM-XML
• Please try it out and give feedback!
What is next?
• Magellan 1.0.4
– Spark 1.6
– Python
– Spatial Join
– Persistent Indices
– Better leverage Tungsten via codegen + memory
layout optimizations
– More operators: buffer, distance, area etc.
the join revisited
• What is the time complexity?
– m points, n polygons (assume average k edges)
– l partitions
– O(mn/l) computations of ‘point within ‘polygon
– O(ml) communication cost
– Each ‘point within ‘polygon costs O(k)
– Total cost = O(ml) + O(mnk/l)
– O(m√n√k) cost, with O(√n√k) partitions
Optimization?
• Do we need to send every point to every
partition?
• Do we need to compute ‘point in
‘neighborhood for each neighborhood
within a given partition?
2d indices
• Quad Tree
• R Tree
• Dimensional Reduction
– Hashing
– PCA
– Space Filling Curves
dimensional reduction
• What does a good dimensional reduction
look like?
– (Approximately) preserve nearness
– enable range queries
– No (little) collision
row order curve
snake order curve
z order curve
Binary Representation
Binary Representation
properties
• Locality: Two points differing in a small # of
bits are near each other
– converse not necessarily true!
• Containment
• Efficient construction
• Nice bounds on precision
geohash
• Z order curve with base 32 encoding
• Start with world boundary (-90,90) X (-180,
180) and recursively subdivide based on
precision
encode (-74.009, 40.7401)
• 40.7401 is in [0, 90) => first bit = 1
• 40.7401 is in [0, 45) => second bit = 0
• 40.7401 is in [22.5, 45) => third bit = 1
• …
• do same for longitude
answer = dr5rgb
decode dr5rgb
•Decode from Base 32 -> Binary
– 01100 10111 00101 01111 01010
•lat = 101110001111, long = 0100101111000
•Now decode binary -> decimal.
– latitude starts with 1 => between 0 - 90
– second bit = 0 => between 0 - 45
– third bit = 1 => between 22.5 - 45
– ...
An algorithm to scale join?
• Preprocess points
– For each point compute geohash of precision p covering point
• Preprocess neighborhoods
– For each neighborhood compute geohashes of precision p that
intersect neighborhood.
• Inner join on geohash
• Filter out edge cases
Implementation in Spark SQL
• Override Strategy to define SpatialJoinStrategy
– Logic to decide when to trigger this join
• Only trigger if geospatial queries
• Only trigger if join is complex: if n ~ O(1) then broadcast join is
good enough
– Override BinaryNode to handle the physical execution
plan ourselves
• Override execute(): RDD to execute join and return results
– Stitch it up using Experimental Strategies in
SQLContext
Persistent Indices
• Often, the geometry dataset does not
change… eg. neighborhoods
• Index the dataset once and for all?
Persistent Indices
• Use Magellan to generate spatial indices
– Think of geometry as document, list of geohashes as
words!
• Persist indices to Elastic Search
• Use Magellan Data Source to query indexed ES
data
• Pushdown geometric predicates where possible
– Predicate rewritten to IR query
Overall architecture
Elastic search
Shard Server
Spark Cluster
nbd.filter(
point(…)
within
‘polygon
)
curl –XGET ‘http://…’ –d ‘{
“query” : {
“filtered” : {
“filter” : {
“geohash” : [“dr5rgb”]
}
}
}
}’
THANK YOU.
Twitter: @halfabrane, Github: @harsha2010

Mais conteúdo relacionado

Mais procurados

Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Report_SmartSuggest
Report_SmartSuggestReport_SmartSuggest
Report_SmartSuggestJigar Shah
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017Kisung Kim
 
Rabbit Order: Just-in-time Reordering for Fast Graph Analysis
Rabbit Order: Just-in-time Reordering for Fast Graph AnalysisRabbit Order: Just-in-time Reordering for Fast Graph Analysis
Rabbit Order: Just-in-time Reordering for Fast Graph AnalysisJunya Arai
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimizationKisung Kim
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph IndexingKisung Kim
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeNational Institute of Informatics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
 

Mais procurados (20)

Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Report_SmartSuggest
Report_SmartSuggestReport_SmartSuggest
Report_SmartSuggest
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017AgensGraph Presentation at PGConf.us 2017
AgensGraph Presentation at PGConf.us 2017
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Rabbit Order: Just-in-time Reordering for Fast Graph Analysis
Rabbit Order: Just-in-time Reordering for Fast Graph AnalysisRabbit Order: Just-in-time Reordering for Fast Graph Analysis
Rabbit Order: Just-in-time Reordering for Fast Graph Analysis
 
SPARQL and RDF query optimization
SPARQL and RDF query optimizationSPARQL and RDF query optimization
SPARQL and RDF query optimization
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Survey of Graph Indexing
Survey of Graph IndexingSurvey of Graph Indexing
Survey of Graph Indexing
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Mining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
 

Semelhante a Sparksummitny2016

Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxIvo Andreev
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkSpark Summit
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkSpark Summit
 
Computer Vision descriptors
Computer Vision descriptorsComputer Vision descriptors
Computer Vision descriptorsWael Badawy
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringMachine Learning Valencia
 
cnn.pptx
cnn.pptxcnn.pptx
cnn.pptxsghorai
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMark Lefevre, CQF
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 

Semelhante a Sparksummitny2016 (20)

Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
LSH
LSHLSH
LSH
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Computer Vision descriptors
Computer Vision descriptorsComputer Vision descriptors
Computer Vision descriptors
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
cnn.pptx
cnn.pptxcnn.pptx
cnn.pptx
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
 
14 spatial analyst
14   spatial analyst14   spatial analyst
14 spatial analyst
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 

Último

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Último (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Sparksummitny2016

  • 1. Magellan – Spark as a Geospatial Analytics Engine Ram Sriharsha Hortonworks
  • 2. Who Am I? • Apache Spark PMC Member • Hortonworks Architect, Spark + Data Science • Prior to HWX, Principal Research Scientist @ Yahoo Labs (Large Scale Machine Learning) – Login Risk Detection, Sponsored Search Click Prediction, Advertising Effectiveness Models, Online Learning, …
  • 3. What is Geospatial Analytics? How do pickup/ dropoff neighborhood hotspots evolve with time? Correct GPS errors with more Accurate landmark measurements Incorporate location in IR and search advertising
  • 4. Do we need one more library? • Spatial Analytics at scale is challenging – Simplicity + Scalability = Hard • Ancient Data Formats – metadata, indexing not handled well, inefficient storage • Geospatial Analytics is not simply BI anymore – Statistical + Machine Learning being leveraged in geospatial • Now is the time to do it! – Explosion of mobile data – Finer granularity of data collection for geometries – Analytics stretching the limits of traditional approaches – Spark SQL + Catalyst + Tungsten makes extensible SQL engines easier than ever before!
  • 8. Introduction to Magellan neighborhoods.filter( point(-122.4111659, 37.8003388) within ‘polygon ).show() Shape literal Boolean Expression
  • 10. Introduction to Magellan neighborhoods.filter( point(-122.4111659, 37.8003388).buffer(0.1) intersects ‘polygon ).show()
  • 12. the join • Inherits all join optimizations from Spark SQL – if neighborhoods table is small, Broadcast Cartesian Join – else Cartesian Join
  • 13. Status • Magellan 1.0.3 available as Spark Package. • Scala • Spark 1.4 • Spark Package: Magellan • Github: https://github.com/harsha2010/magellan • Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in- spark/ • Notebook example: http://bit.ly/1GwLyrV • Input Formats: ESRI Shapefile, GeoJSON, OSM-XML • Please try it out and give feedback!
  • 14. What is next? • Magellan 1.0.4 – Spark 1.6 – Python – Spatial Join – Persistent Indices – Better leverage Tungsten via codegen + memory layout optimizations – More operators: buffer, distance, area etc.
  • 15. the join revisited • What is the time complexity? – m points, n polygons (assume average k edges) – l partitions – O(mn/l) computations of ‘point within ‘polygon – O(ml) communication cost – Each ‘point within ‘polygon costs O(k) – Total cost = O(ml) + O(mnk/l) – O(m√n√k) cost, with O(√n√k) partitions
  • 16.
  • 17. Optimization? • Do we need to send every point to every partition? • Do we need to compute ‘point in ‘neighborhood for each neighborhood within a given partition?
  • 18. 2d indices • Quad Tree • R Tree • Dimensional Reduction – Hashing – PCA – Space Filling Curves
  • 19. dimensional reduction • What does a good dimensional reduction look like? – (Approximately) preserve nearness – enable range queries – No (little) collision
  • 25. properties • Locality: Two points differing in a small # of bits are near each other – converse not necessarily true! • Containment • Efficient construction • Nice bounds on precision
  • 26. geohash • Z order curve with base 32 encoding • Start with world boundary (-90,90) X (-180, 180) and recursively subdivide based on precision
  • 27. encode (-74.009, 40.7401) • 40.7401 is in [0, 90) => first bit = 1 • 40.7401 is in [0, 45) => second bit = 0 • 40.7401 is in [22.5, 45) => third bit = 1 • … • do same for longitude answer = dr5rgb
  • 28. decode dr5rgb •Decode from Base 32 -> Binary – 01100 10111 00101 01111 01010 •lat = 101110001111, long = 0100101111000 •Now decode binary -> decimal. – latitude starts with 1 => between 0 - 90 – second bit = 0 => between 0 - 45 – third bit = 1 => between 22.5 - 45 – ...
  • 29. An algorithm to scale join? • Preprocess points – For each point compute geohash of precision p covering point • Preprocess neighborhoods – For each neighborhood compute geohashes of precision p that intersect neighborhood. • Inner join on geohash • Filter out edge cases
  • 30. Implementation in Spark SQL • Override Strategy to define SpatialJoinStrategy – Logic to decide when to trigger this join • Only trigger if geospatial queries • Only trigger if join is complex: if n ~ O(1) then broadcast join is good enough – Override BinaryNode to handle the physical execution plan ourselves • Override execute(): RDD to execute join and return results – Stitch it up using Experimental Strategies in SQLContext
  • 31.
  • 32. Persistent Indices • Often, the geometry dataset does not change… eg. neighborhoods • Index the dataset once and for all?
  • 33. Persistent Indices • Use Magellan to generate spatial indices – Think of geometry as document, list of geohashes as words! • Persist indices to Elastic Search • Use Magellan Data Source to query indexed ES data • Pushdown geometric predicates where possible – Predicate rewritten to IR query
  • 34. Overall architecture Elastic search Shard Server Spark Cluster nbd.filter( point(…) within ‘polygon ) curl –XGET ‘http://…’ –d ‘{ “query” : { “filtered” : { “filter” : { “geohash” : [“dr5rgb”] } } } }’
  • 35.
  • 36. THANK YOU. Twitter: @halfabrane, Github: @harsha2010