In this research, we propose a MapReduce al- gorithm for creating contiguity-based spatial weights. This algorithm provides the ability to create spatial weights from very large spatial datasets efficiently by using computing re- sources that are organized in the Hadoop framework. It works in the paradigm of MapReduce: mappers are dis- tributed in computing clusters to find contiguous neighbors in parallel, then reducers collect the results and generate the weights matrix. To test the performance of this al- gorithm, we design experiment to create contiguity-based weights matrix from artificial spatial data with up to 190 million polygons using Amazon’s Hadoop framework called Elastic MapReduce. The experiment demonstrates the scal- ability of this parallel algorithm which utilizes large com- puting clusters to solve the problem of creating contiguity weights on Big data.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Big spatial2014 mapreduceweights
1. The Problem
A MapReduce Algorithm to Create Contiguity
Weights for Spatial Analysis of Big Data
Xun Li, Wenwen Li, Luc Anselin, Sergio Rey, Julia
Nov 4, 2014
BIGSPATIAL 2014
Koschinsky
1
2. Big Spatial Data Challenge
Cyber-Framework: CyberGIS, Spatial Hadoop
2
Big Spatial Data Domain
Spatial
Data
Management
Computing
Grids
Super
Computers
HPC
Spatial
Analysis
Cloud Computing
Platform
Visualization
Spatial
Process
Modeling
Spatial
Pattern
Detection
3. Spatial Analysis on Big Data
3
Spatial
Analysis
Spatial Data
Preprocessing
Spatial Data
Exploration
Spatial Model
Specification
Spatial Model
Estimation
Spatial Model
Validation
Spatial
Clustering/Autocorre
lation
Spatial Lag Model
Spatial Error Model
Spatial Weights: W
Spatial Statistics
Example:
4. Spatial Weights
Spatial Weights
• Spatial weights is an essential component in spatial analysis where a
representation of spatial structure is needed.
• Tobler: “Everything is related to everything else, but near things are
more related to each other”.
Create Spatial Weights (W)
• Extract spatial structure:
• Spatial neighboring information (contiguity based weights)
• Spatial distance information (distance based weights)
4
A B C D E
A 0 1 0 0 0
B 1 0 1 1 0
C 0 1 0 1 1
D 0 1 1 0 0
E 0 0 1 0 0
A B C D E
2.5
2.5
3.5
A 0
1.2
B 1.2 0
2.3
0.7
C 2.3
0
1.1
D 0.7 1.1
0
E 0.3
0
4.5
0.3
2.5
2.5
3.5
4.5
0.1
0.1
Contiguity based Weights Distance based Weights
5. Contiguity Spatial Weights: how to find neighbors
5
Classic Algorithms:
• Brutal force search :
• Test A against B,C,D,E | B against C,D,E | C against D,E | D against E
• O(n2)
• Spatial Index :
• Binning algorithm
• r-tree index
O(n logn)
• Rook Contiguity:
neighbors share borders
• Queen Contiguity:
neighbors share borders or vertices
6. Parallelize Spatial Weights Creation for big data?
6
Split data with a buffer zone
A B C D E
A 0 1 1 1 0
B 1 0 0 1 0
C 1 0 0 1 0
D 1 1 1 0 1
E 0 0 0 1 0
12. MapReduce Contiguity Weights Creation –Cont.
12
Other Details:
• Input data (each line):
e.g.
A, 1,2,3,4,5,6
• Output data *.gal file (every two lines):
e.g.
A 3
B C D
• Source code:
https://github.com/lixun910/mrweights
13. Experiments
13
Original Data:
• parcel data of Chicago city in the United States
• 592,521 polygons
Artificial Big Data:
• Duplicate original data several times side by side
• For example: a 4x original data with 2,370,084 polygons
• The largest test data is a 32x original data
14. Experiment
14
Test System
• Desktop Computer
• 2.93 GHz 8 cores CPU, 16 GB memory, 100 GB HD and 64-
bit Operating System
• Hadoop System
• Amazon Elastic MapReduce (EMR)
• 1 to 18 nodes of “C3 Extra Large” computer instance
(7.5 GB memory, 14 cores (4 core x 3.5 unit) CPU, 80 GB (2 x
40GB SSD), 64-bit Operating System and 500Mbps moderate
network speed )
15. Experiment
15
Code/Application
• Desktop version (Python)
• No parallel
• Hadoop version (Python)
• Executed via Hadoop streaming pipeline
16. Experiment-1
16
PC v.s. Hadoop
• Data: 1x, 2x, 4x, 8x, 16x and 32x data respectively
• Hadoop setup: 6 nodes of C3.xlarge
17. Experiment-2
17
Hadoop with different number of nodes on 32x data
• Hadoop setup: 6, 12, 14, 18 nodes of C3.xlarge
18. Integrate to Weights Creation Web Service
18
HPC Pool & Hadoop
Threshold to trigger
Hadoop Weights
Creation:
2 million polygons
19. Issues
19
• This algorithm won’t work when spatial neighbors do not share
points or edges (it requires the shared points are exactly same)
• This algorithm can’t generate distance based weights
• Potential solution
• Use MapReduce r-tree (SpatialHadoop)
20. Conclusion
• Contribution: a MapReduce algorithm to create
contiguity weights matrix for big spatial data
• Ongoing work: use existing MapReduce r-tree to solve
the potential issues of this algorithm
20
Hot topic
Much research has focused on creating a cber-framework
Computing resources includes: computing grids, super computers, HPC, cloud computing platform etc.
5 import components
SA provides scientists Ability to analyze big data statistically
Is a process of
Spatial weights is an essential part of spatial analysis since it represents the geographic structure of spatial objects.
For example,..
However, current data structure and algorithms base on sing desk com arch
There are some research work tried to parallelize spatial analysis,
however, they are still not capable of dealing with big data.
And no one talks about creating spatial weights, which is the first step to solve this problem.
Spatial Weights
Create Spatial Weights
What is W? W is most represented using a matrix, called weights matrix.
Each cell value represent the spatial relationship between object I and J
If the cell value is Zero, then the two objects has no spatial relationship in this weights matrix
Contiguity weights matrix is a binary matrix. Value 1 represents two objects are contiguous. They are neighbors.
Distance weights matrix uses actual distance between two objects.
r-tree works by group nearby objects using their bounding box at different hierarchical level for a fast search.
For each spatial object, it takes O(logn) time to find candidate neighbors
r-tree has faster search time than binning algorithm, but it takes longer time to create a r-tree index.
So, binning algorithm is more practical than r-tree
However, find a buffer zone takes extra time, and since the geometries have irregular shapes, most of the time it’s hard to find a proper buffer zone.
Another solution, which we are trying now is using the MapReduced r-tree, and we can talk about it later.
HDFS: Hadoop Distributed File System
Since Hadoop will spend extra time to deliver program and communicate with running nodes,
it is actually slower than running the same program on the desktop computer for dataset less than 4-time of the raw data (2 million)
However, the bigger the data, the better performance this algorithm can achieve on the Hadoop system.
For example, for a 8x data, the algorithm on Hadoop took 167 seconds to complete,
and the runtime is much faster than that on a desktop computer (482.67 seconds)
The PC can’t handle 16x data 8 million.
the running time increases linearly , which means this algorithm can be scaled up with growing size of data
The best performance we can get from all tests is using 18 computer nodes in Hadoop to create contiguity weights file using 32x data in 163 seconds.
The running time also does not decline linearly with the increasing number of computing nodes.
This phenomenon is reasonable since there will be some extra time used for larger number of computing nodes to communicate inside the Hadoop system.
Web Processing Service (WPS)
We demonstrate the capability and efficiency of this algorithm by generating the weights file for big spatial data using Amazon’s hadoop system.