SlideShare uma empresa Scribd logo
1 de 42
MONGODB FOR MULTI-DIMENSION
SPATIAL INDEXING
DECEMBER 2012
@nknize
+Nicholas Knize
Thermopylae Sciences & Technology – Who are we?
• Mixed Government (70%) and Commercial (30%) contracting
company w/ ~150 employees
• Core customers:
– SOUTHCOM, Intel & Security Command, Army Intel Sector, DOI
– LVMS, Select Energy Oil & Gas, OSU, Cleveland Cavaliers, and STL Rams
• #1 Google Enterprise partner for Federal and partner w/
imagery providers (GeoEye / Digital Globe)
• FOSS4G contributor and 10gen Enterprise partner
WHO ARE THESE GUYS?
ACCOMPLISHING THE IMPOSSIBLE
ENTERPRISE
PARTNER
“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one
location…this capability allows for unprecedented situational awareness and information sharing”
-Gen. Doug Frasier
TST PRODUCTS
ACCOMPLISHING THE IMPOSSIBLE
COMMERCIAL CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE
Commercial Examples
Cleveland
Cavaliers
USGIF Las Vegas
Motor Speedway
Baltimore
Grand Prix
iSpatial framework serves millions of mobile devices
1. iSpatial provides web-based interface for Multi-INT visualization and collaborations
2. Map/Reduce provides spatial statistic processing (spatial regression) and heuristics
3. Modified MongoDB provides storing and indexing multi-dimension spatial data at scale
TST ARCHITECTURE
ACCOMPLISHING THE IMPOSSIBLE
iSpatial – UI/Visualization
Hadoop M/R – Processing / Analysis
MongoDB – Spatial Data Management @ Scale
1 2
3
What the…..HOW MUCH DATA?!?
• “Swimming in sensors drowning in data”
– What size data tsunami are we talking about?
• “Fix and Finish are meaningless until FIND is accomplished”
– A “Big Data” Spatial Search Problem
THAT’S A LOT OF DATA….
ACCOMPLISHING THE IMPOSSIBLE
Sensor Type Resolution Data Bandwidth TB/Hr
FMV 640 x 480 (Std Def)
1920 x 1080 (HD)
HD: 16bit x 3 bands @
30fps ~1Gbps
~0.45 TB
WAMI Constant Hawk = 96 Mpx
Gorgon Stare = 460 Mpx
Argus = 1.8 Gpx
GS @ 16bit x 3 bands @
2fps ~15.3Gps
Argus @ 16bit x 3 bands
@ 12fps ~345.6Gps
~6.89 TB
~155 TB
Satellite NITF / JP2 resolutions
32K x 32K
432K x 216K
32K x 32K @ 8bit x 3
bands @ 1frame/5mins
~27Gps
~12.15 TB
• Horizontally scalable – Large volume / elastic
• Vertically scalable – Heterogeneous data types (“Data Stack”)
• Smartly Distributed – Reduce the distance bits must travel
• Fault Tolerant – Replication Strategy and Consistency model
• High Availability – Node recovery
• Fast – Reads or writes (can’t always have both)
BIG DATA STORAGE CHARACTERISTICS
ACCOMPLISHING THE IMPOSSIBLE
Desired Data Store Characteristic for ‘Big Data’
• Cassandra
– Nice Bring Your Own Index (BYOI) design
– … but Java, Java, Java… Memory management can be a maintenance issue
– Adding new nodes can be a pain (Token Changes, nodetool)
– Key-Value store…good for simple data models
• Hbase
– Nice BigTable model
– Key-Value store…good for simple data models
– Lots of Java JNI (primarily based on std:hashmap of std:hashmap)
• CouchDB
– Provides some GeoSpatial functionality (Currently being rewritten)
– HEAVILY dependent on Map-Reduce model (complicated design)
– Erlang based – poor multi-threaded heap management
NOSQL OPTIONS
ACCOMPLISHING THE IMPOSSIBLE
Subset of Evaluated NoSQL Options
Why MongoDB for Thermopylae?
• Documents based on JSON – A GEOJSON match made in heaven! (OGC)
• C++ - No Garbage Collection Overhead! Efficient memory management
design reduces disk swapping and paging
• Disk storage is memory mapped, enabling fast swapping when necessary
• Built in auto-failover with replica sets and fast recovery with journaling
• Tunable Consistency – Consistency defined at application layer
• Schema Flexible – friendly properties of SQL enable easy port
• Provided initial spatial indexing support – Point based limited!
WHY TST <3’S MONGODB
ACCOMPLISHING THE IMPOSSIBLE
MONGODB SPATIAL INDEXER
ACCOMPLISHING THE IMPOSSIBLE
... The Spatial Indexer wasn’t quite right
• MongoDB (like nearly all relational DBs) uses a b-Tree
– Data structure for storing sorted data in log time
– Great for indexing numerical and text documents (1D attribute data)
– Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY
FRIENDLY
DIMENSIONALITY REDUCTION
ACCOMPLISHING THE IMPOSSIBLE
How does MongoDB solve the dimensionality problem?
• Space Filling (Z) Curve
– A continuous line that
intersects every point in a
two-dimensional plane
• Use Geohash to
represent lat/lon values
– Interleave the bits of a
lat/long pair
– Base32 encode the result
GEOHASH BTREE ISSUES
ACCOMPLISHING THE IMPOSSIBLE
• Neighbors aren’t so
close!
– Neighboring points on the
Geoid may end up on
opposite ends of the
plane
– Impacts search efficiency
• What about Geometry?
– Doesn’t support > 2D
– Mongo uses Multi-
Location documents
which really just indexes
multiple points that link
back to a single document
Issues with the Geohash b-Tree approach
Sort Order and Multi-Dimension…a nightmare
(3D / 4D Hilbert Scanning Order)
GEO-SHARDING ALTERNATIVE
ACCOMPLISHING THE IMPOSSIBLE
Case 3:
Case 4:
Multi-Location Document (aka. Polygon) Search Polygon
Case 1:
Case 2:
Success!
Success!
Fail!
Fail!
Mongo Multi-location Document Clipping Issues
($within search doesn’t always work w/ multi-location)
MULTI-LOCATION CLIPPING
ACCOMPLISHING THE IMPOSSIBLE
• Constrain the system to single point searches
– Multi-dimension support will be exponentially complex (won’t scale)
• Interpolate points along the edge of the shape
– Multi-dimension support will be exponentially complex (won’t scale)
• Customize the spatial indexer
– Selected approach
SOLUTIONS TO GEOHASH PROBLEM
ACCOMPLISHING THE IMPOSSIBLE
Potential Solutions
CUSTOM TUNED SPATIAL INDEXER
ACCOMPLISHING THE IMPOSSIBLE
Thermopylae Custom Tuned MongoDB for Geo
TST Leverage’s Kriegel’s 1996 Research in R* Trees
• R-Trees organize any-dimensional data by representing
the data as a minimum bounding box.
• Each node bounds it’s children. A node can have many
objects in it (max: m min: ceil(m/2) )
• Splits and merges optimized by minimizing overlaps
• The leaves point to the actual objects (stored on disk
probably)
• Height balanced – search is always O(log n)
Spatial Indexing at Scale with R-Trees
RTREE THEORY
ACCOMPLISHING THE IMPOSSIBLE
Spatial data represented as minimum bounding rectangles (2-
dimension), cubes (3-dimension), hexadecant (4-dimension)
Index represented as: <I, DiskLoc> where:
I = (I0, I1, … In) : n = number of dimensions
Each I is a set in the form of [min,max] describing MBR range along a dimension
R*-Tree Spatial Index Example
• Sample insertion result for 4th order
tree
• Objectives:
1. Minimize area
2. Minimize overlaps
3. Minimize margins
4. Maximize inner node utilization
a b cd e f g h i j k l
m n o p
R*-TREE INDEX OBJECTIVES
ACCOMPLISHING THE IMPOSSIBLE
Insert
• Similar to insertion into B+-tree but may insert
into any leaf; leaf splits in case capacity exceeded.
– Which leaf to insert into?
– How to split a node?
R*-TREE INSERT EXAMPLE
ACCOMPLISHING THE IMPOSSIBLE
Insert—Leaf Selection
• Follow a path from root to leaf.
• At each node move into subtree whose MBR area
increases least with addition of new rectangle.
m
n
o p
Insert—Leaf Selection
• Insert into m.
m
Insert—Leaf Selection
• Insert into n.
n
Insert—Leaf Selection
• Insert into o.
o
Insert—Leaf Selection
• Insert into p.
p
m
n
o p
a
a
a
x
a b cd e f g h i j k l
m n o p
Query
• Start at root
• Find all overlapping MBRs
• Search subtrees recursively
Query
• Search m.
m
n
o p
a
a
x x
a b cd e f g h i j k l
m n o p
a
a
a
b
c
d
e
g
R*-Tree Leverages B-Tree Base Data Structures (buckets)
R*-TREE MONGODB IMPLEMENTATION
ACCOMPLISHING THE IMPOSSIBLE
Spatial Index
Architecture, Organization, & Performance
MBRKeyNode(s)
BucketHeader
MBRHeader
…
Dimensions Num Buckets Tree Height Read Time
3 3,448,276 3 190 ms
5 50,76,143 3 275 ms
100 90,909,091 8 ~4.9 sec
1B Polygon Read Performance (worst case O(n))
SPATIAL INDEX ARCH & ORG
ACCOMPLISHING THE IMPOSSIBLE
Geo-Sharding – (in work)
Scalable Distributed R* Tree (SD-r*Tree)
“Balanced” binary tree, with
nodes distributed on a set of
servers:
• Each internal node has
exactly two children
• Each leaf node stores a
subset of the indexed
dataset
• At each node, the height
of the subtrees differ by
at most one
• mongos “routing” node
maintains binary tree
GEO-SHARDING
ACCOMPLISHING THE IMPOSSIBLE
d0 d1
r1d0
Data Node Spatial
Coverage
a a
b
c
cb d0
r1
a
b
c
c
b
d2d1
e
d
d
r2
e
SD-r*Tree Data Structure Illustration
• di = Data Node (Chunk)
• ri = Coverage Node
Leveraged work from Litwin, Mouza, Rigaux 2007
SD-r*Tree DATA STRUCTURE
ACCOMPLISHING THE IMPOSSIBLE
SD-r*Tree Structure Distribution
d0
r1
a
b
c
c
b
d2d1
e
d
d
r2
e
r2
d1 d2
d0
r1
GeoShard 2 GeoShard 3
GeoShard 1
mongos
SD-r*TREE STRUCTURE DISTRIBUTION
ACCOMPLISHING THE IMPOSSIBLE
Beyond 4-Dimensions - X-Tree
(Berchtold, Keim, Kriegel – 1996)
Normal Internal Nodes Supernodes Data Nodes
• Avoid MBR overlaps – more overlaps approaches worst case O(n) read
• Avoid node splits (main cause for high overlap)
• Introduce new node structure: Supernodes – Large Directory nodes of variable size
BEYOND 4-DIMENSIONS
ACCOMPLISHING THE IMPOSSIBLE
X-TREE PERFORMANCE
ACCOMPLISHING THE IMPOSSIBLE
X-Tree Performance Results
(Berchtold, Keim, Kriegel – 1996)
T-Sciences Custom Tuned Spatial Indexer
• Optimized Spatial Search – Finds intersecting MBR and recurses into
those nodes
• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to
guide search
– 28% reduction in number of nodes touched
• Optimize Deletes – Leverages R* split/merge approach for rebalancing
tree when nodes become over/under-full
• Low maintenance – Leverages MongoDB’s automatic data compaction
and partitioning
CONCLUSION
ACCOMPLISHING THE IMPOSSIBLE
Example: Mosaicked Video with KLV Footprints
SLIDESHOW HEADER
ACCOMPLISHING THE IMPOSSIBLE
• Rip through
KLV Metadata
• Index frame
footprints, and
annotations as
MBR into
X(R*)-Tree
• Leverage Geo-
Sharding for
spatially
relevant scale
Example Use Case – OSINT (Foursquare Data)
• Sample Foursquare
data set mashed with
Government Intel
Data (poly reports)
• 100 million Geo
Document test (3D
points and polys)
• 4 server replica set
• ~350ms query
response
• ~300%
improvement over
PostGIS
EXAMPLE
ACCOMPLISHING THE IMPOSSIBLE
Community Support
• Thermopylae plans to open source
– http://github.com/thermopylae
• TST working with 10gen to offer as a spatial extension
• Active developer collaboration
– IRC: #mongodb freenode.net
FIND US
ACCOMPLISHING THE IMPOSSIBLE
THANK YOU
Questions?
Nicholas Knize
nknize@t-sciences.com
THANK YOU
ACCOMPLISHING THE IMPOSSIBLE
Backup
Key Customers - Government
• US Dept of State Bureau of Diplomatic Security
– Build and support 30 TB Google Earth Globe with multi-
terabytes of individual globes sent to embassies throughout
the world. Integrated Google Earth and iSpatial framework.
• US Army Intelligence Security Command
– Provide expertise in managing technology integration –
prime contractor providing operations, intelligence, and IT
support worldwide. Partners include IBM, Lockheed Martin,
Google, MIT, Carnegie Mellon. Integrated Google Earth and
iSpatial framework.
• US Southern Command
– Coordinate Intelligence management systems spatial data
collection, indexing, and distribution. Integrated Google
Earth, iSpatial, and iHarvest.
– Index large volume imagery and expose it for different
services (Air Force, Navy, Army, Marines, Coast Guard)
GOVERNMENT CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE
COMMERCIAL CUSTOMERS
ACCOMPLISHING THE IMPOSSIBLE
Key Customers - Commercial
Cleveland
Cavaliers
USGIF Las Vegas
Motor Speedway
Baltimore
Grand Prix
iSpatial framework serves millions of mobile devices
• Expose and manage Multi-INT enterprise data in a geo-temporal
user defined environment
• Provide a flexible and scalable spatial data infrastructure (SDI)
for Multi-INT data access and analysis
• Spatially referenced data visualization on 3D globe & 2D maps
• Access real/near real-time data feeds from forward deployed
devices
• Enable real-time information sharing and mission collaboration
ISPATIAL OVERVIEW
ACCOMPLISHING THE IMPOSSIBLE

Mais conteúdo relacionado

Mais procurados

Datomic rtree-pres
Datomic rtree-presDatomic rtree-pres
Datomic rtree-pres
jsofra
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 

Mais procurados (20)

Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Datomic rtree-pres
Datomic rtree-presDatomic rtree-pres
Datomic rtree-pres
 
Development of a Prototype Web GIS Server for HDF-EOS Data based on OGC Web M...
Development of a Prototype Web GIS Server for HDF-EOS Data based on OGC Web M...Development of a Prototype Web GIS Server for HDF-EOS Data based on OGC Web M...
Development of a Prototype Web GIS Server for HDF-EOS Data based on OGC Web M...
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Expressing and Exploiting Multi-Dimensional Locality in DASH
Expressing and Exploiting Multi-Dimensional Locality in DASHExpressing and Exploiting Multi-Dimensional Locality in DASH
Expressing and Exploiting Multi-Dimensional Locality in DASH
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Why is postgis awesome?
Why is postgis awesome?Why is postgis awesome?
Why is postgis awesome?
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
GeoServer on Steroids at FOSS4G Europe 2014
GeoServer on Steroids at FOSS4G Europe 2014GeoServer on Steroids at FOSS4G Europe 2014
GeoServer on Steroids at FOSS4G Europe 2014
 
Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with Hadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 

Destaque

Lecture 8Cylinders & open and closed circuit
Lecture 8Cylinders & open and closed circuitLecture 8Cylinders & open and closed circuit
Lecture 8Cylinders & open and closed circuit
Javaid Toosy
 
Revisions in the new d1.1 2010
Revisions in the new d1.1 2010Revisions in the new d1.1 2010
Revisions in the new d1.1 2010
ruilong9
 

Destaque (20)

3DRepo
3DRepo3DRepo
3DRepo
 
3D + MongoDB = 3D Repo
3D + MongoDB = 3D Repo3D + MongoDB = 3D Repo
3D + MongoDB = 3D Repo
 
Salesforce - classification of cloud computing
Salesforce - classification of cloud computingSalesforce - classification of cloud computing
Salesforce - classification of cloud computing
 
Robotics classes in mumbai
Robotics classes in mumbaiRobotics classes in mumbai
Robotics classes in mumbai
 
How Fannie Mae Leverages Data Quality to Improve the Business
How Fannie Mae Leverages Data Quality to Improve the BusinessHow Fannie Mae Leverages Data Quality to Improve the Business
How Fannie Mae Leverages Data Quality to Improve the Business
 
Robotics training in mumbai
Robotics training in mumbai Robotics training in mumbai
Robotics training in mumbai
 
Python classes in mumbai
Python classes in mumbaiPython classes in mumbai
Python classes in mumbai
 
2011 Annual Report
2011 Annual Report2011 Annual Report
2011 Annual Report
 
Boyuan Construction Investor Presentation
Boyuan Construction Investor PresentationBoyuan Construction Investor Presentation
Boyuan Construction Investor Presentation
 
How to Accelerate Backup Performance with Dell DR Series Backup Appliances
How to Accelerate Backup Performance with Dell DR Series Backup AppliancesHow to Accelerate Backup Performance with Dell DR Series Backup Appliances
How to Accelerate Backup Performance with Dell DR Series Backup Appliances
 
Cloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the CloudCloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the Cloud
 
Business Innovation Approach
Business Innovation ApproachBusiness Innovation Approach
Business Innovation Approach
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
Reduce Networking/Infrastructure Cost With Multi-Vendor Equipment and Support...
 
Dematic Logistics Review #5
Dematic Logistics Review #5Dematic Logistics Review #5
Dematic Logistics Review #5
 
PCB DESIGN - Introduction to PCB Design Manufacturing
PCB DESIGN - Introduction to PCB Design ManufacturingPCB DESIGN - Introduction to PCB Design Manufacturing
PCB DESIGN - Introduction to PCB Design Manufacturing
 
Market Intelligence FY15 Defense Budget Briefing
 Market Intelligence FY15 Defense Budget Briefing Market Intelligence FY15 Defense Budget Briefing
Market Intelligence FY15 Defense Budget Briefing
 
Lecture 8Cylinders & open and closed circuit
Lecture 8Cylinders & open and closed circuitLecture 8Cylinders & open and closed circuit
Lecture 8Cylinders & open and closed circuit
 
Linux administration classes in mumbai
Linux administration classes in mumbaiLinux administration classes in mumbai
Linux administration classes in mumbai
 
Revisions in the new d1.1 2010
Revisions in the new d1.1 2010Revisions in the new d1.1 2010
Revisions in the new d1.1 2010
 

Semelhante a High Dimensional Indexing using MongoDB (MongoSV 2012)

NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
Igor Moochnick
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Deltares
 

Semelhante a High Dimensional Indexing using MongoDB (MongoSV 2012) (20)

Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Building a Spatial Database in PostgreSQL
Building a Spatial Database in PostgreSQLBuilding a Spatial Database in PostgreSQL
Building a Spatial Database in PostgreSQL
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
 
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
Miguel Angel Fajardo - NewSQL: the magic wand of data - Codemotion Rome 2019
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

High Dimensional Indexing using MongoDB (MongoSV 2012)

  • 1. MONGODB FOR MULTI-DIMENSION SPATIAL INDEXING DECEMBER 2012 @nknize +Nicholas Knize
  • 2. Thermopylae Sciences & Technology – Who are we? • Mixed Government (70%) and Commercial (30%) contracting company w/ ~150 employees • Core customers: – SOUTHCOM, Intel & Security Command, Army Intel Sector, DOI – LVMS, Select Energy Oil & Gas, OSU, Cleveland Cavaliers, and STL Rams • #1 Google Enterprise partner for Federal and partner w/ imagery providers (GeoEye / Digital Globe) • FOSS4G contributor and 10gen Enterprise partner WHO ARE THESE GUYS? ACCOMPLISHING THE IMPOSSIBLE ENTERPRISE PARTNER
  • 3. “The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one location…this capability allows for unprecedented situational awareness and information sharing” -Gen. Doug Frasier TST PRODUCTS ACCOMPLISHING THE IMPOSSIBLE
  • 4. COMMERCIAL CUSTOMERS ACCOMPLISHING THE IMPOSSIBLE Commercial Examples Cleveland Cavaliers USGIF Las Vegas Motor Speedway Baltimore Grand Prix iSpatial framework serves millions of mobile devices
  • 5. 1. iSpatial provides web-based interface for Multi-INT visualization and collaborations 2. Map/Reduce provides spatial statistic processing (spatial regression) and heuristics 3. Modified MongoDB provides storing and indexing multi-dimension spatial data at scale TST ARCHITECTURE ACCOMPLISHING THE IMPOSSIBLE iSpatial – UI/Visualization Hadoop M/R – Processing / Analysis MongoDB – Spatial Data Management @ Scale 1 2 3
  • 6. What the…..HOW MUCH DATA?!? • “Swimming in sensors drowning in data” – What size data tsunami are we talking about? • “Fix and Finish are meaningless until FIND is accomplished” – A “Big Data” Spatial Search Problem THAT’S A LOT OF DATA…. ACCOMPLISHING THE IMPOSSIBLE Sensor Type Resolution Data Bandwidth TB/Hr FMV 640 x 480 (Std Def) 1920 x 1080 (HD) HD: 16bit x 3 bands @ 30fps ~1Gbps ~0.45 TB WAMI Constant Hawk = 96 Mpx Gorgon Stare = 460 Mpx Argus = 1.8 Gpx GS @ 16bit x 3 bands @ 2fps ~15.3Gps Argus @ 16bit x 3 bands @ 12fps ~345.6Gps ~6.89 TB ~155 TB Satellite NITF / JP2 resolutions 32K x 32K 432K x 216K 32K x 32K @ 8bit x 3 bands @ 1frame/5mins ~27Gps ~12.15 TB
  • 7. • Horizontally scalable – Large volume / elastic • Vertically scalable – Heterogeneous data types (“Data Stack”) • Smartly Distributed – Reduce the distance bits must travel • Fault Tolerant – Replication Strategy and Consistency model • High Availability – Node recovery • Fast – Reads or writes (can’t always have both) BIG DATA STORAGE CHARACTERISTICS ACCOMPLISHING THE IMPOSSIBLE Desired Data Store Characteristic for ‘Big Data’
  • 8. • Cassandra – Nice Bring Your Own Index (BYOI) design – … but Java, Java, Java… Memory management can be a maintenance issue – Adding new nodes can be a pain (Token Changes, nodetool) – Key-Value store…good for simple data models • Hbase – Nice BigTable model – Key-Value store…good for simple data models – Lots of Java JNI (primarily based on std:hashmap of std:hashmap) • CouchDB – Provides some GeoSpatial functionality (Currently being rewritten) – HEAVILY dependent on Map-Reduce model (complicated design) – Erlang based – poor multi-threaded heap management NOSQL OPTIONS ACCOMPLISHING THE IMPOSSIBLE Subset of Evaluated NoSQL Options
  • 9. Why MongoDB for Thermopylae? • Documents based on JSON – A GEOJSON match made in heaven! (OGC) • C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging • Disk storage is memory mapped, enabling fast swapping when necessary • Built in auto-failover with replica sets and fast recovery with journaling • Tunable Consistency – Consistency defined at application layer • Schema Flexible – friendly properties of SQL enable easy port • Provided initial spatial indexing support – Point based limited! WHY TST <3’S MONGODB ACCOMPLISHING THE IMPOSSIBLE
  • 10. MONGODB SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE ... The Spatial Indexer wasn’t quite right • MongoDB (like nearly all relational DBs) uses a b-Tree – Data structure for storing sorted data in log time – Great for indexing numerical and text documents (1D attribute data) – Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY FRIENDLY
  • 11. DIMENSIONALITY REDUCTION ACCOMPLISHING THE IMPOSSIBLE How does MongoDB solve the dimensionality problem? • Space Filling (Z) Curve – A continuous line that intersects every point in a two-dimensional plane • Use Geohash to represent lat/lon values – Interleave the bits of a lat/long pair – Base32 encode the result
  • 12. GEOHASH BTREE ISSUES ACCOMPLISHING THE IMPOSSIBLE • Neighbors aren’t so close! – Neighboring points on the Geoid may end up on opposite ends of the plane – Impacts search efficiency • What about Geometry? – Doesn’t support > 2D – Mongo uses Multi- Location documents which really just indexes multiple points that link back to a single document Issues with the Geohash b-Tree approach
  • 13. Sort Order and Multi-Dimension…a nightmare (3D / 4D Hilbert Scanning Order) GEO-SHARDING ALTERNATIVE ACCOMPLISHING THE IMPOSSIBLE
  • 14. Case 3: Case 4: Multi-Location Document (aka. Polygon) Search Polygon Case 1: Case 2: Success! Success! Fail! Fail! Mongo Multi-location Document Clipping Issues ($within search doesn’t always work w/ multi-location) MULTI-LOCATION CLIPPING ACCOMPLISHING THE IMPOSSIBLE
  • 15. • Constrain the system to single point searches – Multi-dimension support will be exponentially complex (won’t scale) • Interpolate points along the edge of the shape – Multi-dimension support will be exponentially complex (won’t scale) • Customize the spatial indexer – Selected approach SOLUTIONS TO GEOHASH PROBLEM ACCOMPLISHING THE IMPOSSIBLE Potential Solutions
  • 16. CUSTOM TUNED SPATIAL INDEXER ACCOMPLISHING THE IMPOSSIBLE Thermopylae Custom Tuned MongoDB for Geo TST Leverage’s Kriegel’s 1996 Research in R* Trees • R-Trees organize any-dimensional data by representing the data as a minimum bounding box. • Each node bounds it’s children. A node can have many objects in it (max: m min: ceil(m/2) ) • Splits and merges optimized by minimizing overlaps • The leaves point to the actual objects (stored on disk probably) • Height balanced – search is always O(log n)
  • 17. Spatial Indexing at Scale with R-Trees RTREE THEORY ACCOMPLISHING THE IMPOSSIBLE Spatial data represented as minimum bounding rectangles (2- dimension), cubes (3-dimension), hexadecant (4-dimension) Index represented as: <I, DiskLoc> where: I = (I0, I1, … In) : n = number of dimensions Each I is a set in the form of [min,max] describing MBR range along a dimension
  • 18. R*-Tree Spatial Index Example • Sample insertion result for 4th order tree • Objectives: 1. Minimize area 2. Minimize overlaps 3. Minimize margins 4. Maximize inner node utilization a b cd e f g h i j k l m n o p R*-TREE INDEX OBJECTIVES ACCOMPLISHING THE IMPOSSIBLE
  • 19. Insert • Similar to insertion into B+-tree but may insert into any leaf; leaf splits in case capacity exceeded. – Which leaf to insert into? – How to split a node? R*-TREE INSERT EXAMPLE ACCOMPLISHING THE IMPOSSIBLE
  • 20. Insert—Leaf Selection • Follow a path from root to leaf. • At each node move into subtree whose MBR area increases least with addition of new rectangle. m n o p
  • 25. m n o p a a a x a b cd e f g h i j k l m n o p Query • Start at root • Find all overlapping MBRs • Search subtrees recursively
  • 26. Query • Search m. m n o p a a x x a b cd e f g h i j k l m n o p a a a b c d e g
  • 27. R*-Tree Leverages B-Tree Base Data Structures (buckets) R*-TREE MONGODB IMPLEMENTATION ACCOMPLISHING THE IMPOSSIBLE
  • 28. Spatial Index Architecture, Organization, & Performance MBRKeyNode(s) BucketHeader MBRHeader … Dimensions Num Buckets Tree Height Read Time 3 3,448,276 3 190 ms 5 50,76,143 3 275 ms 100 90,909,091 8 ~4.9 sec 1B Polygon Read Performance (worst case O(n)) SPATIAL INDEX ARCH & ORG ACCOMPLISHING THE IMPOSSIBLE
  • 29. Geo-Sharding – (in work) Scalable Distributed R* Tree (SD-r*Tree) “Balanced” binary tree, with nodes distributed on a set of servers: • Each internal node has exactly two children • Each leaf node stores a subset of the indexed dataset • At each node, the height of the subtrees differ by at most one • mongos “routing” node maintains binary tree GEO-SHARDING ACCOMPLISHING THE IMPOSSIBLE
  • 30. d0 d1 r1d0 Data Node Spatial Coverage a a b c cb d0 r1 a b c c b d2d1 e d d r2 e SD-r*Tree Data Structure Illustration • di = Data Node (Chunk) • ri = Coverage Node Leveraged work from Litwin, Mouza, Rigaux 2007 SD-r*Tree DATA STRUCTURE ACCOMPLISHING THE IMPOSSIBLE
  • 31. SD-r*Tree Structure Distribution d0 r1 a b c c b d2d1 e d d r2 e r2 d1 d2 d0 r1 GeoShard 2 GeoShard 3 GeoShard 1 mongos SD-r*TREE STRUCTURE DISTRIBUTION ACCOMPLISHING THE IMPOSSIBLE
  • 32. Beyond 4-Dimensions - X-Tree (Berchtold, Keim, Kriegel – 1996) Normal Internal Nodes Supernodes Data Nodes • Avoid MBR overlaps – more overlaps approaches worst case O(n) read • Avoid node splits (main cause for high overlap) • Introduce new node structure: Supernodes – Large Directory nodes of variable size BEYOND 4-DIMENSIONS ACCOMPLISHING THE IMPOSSIBLE
  • 33. X-TREE PERFORMANCE ACCOMPLISHING THE IMPOSSIBLE X-Tree Performance Results (Berchtold, Keim, Kriegel – 1996)
  • 34. T-Sciences Custom Tuned Spatial Indexer • Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes • Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search – 28% reduction in number of nodes touched • Optimize Deletes – Leverages R* split/merge approach for rebalancing tree when nodes become over/under-full • Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning CONCLUSION ACCOMPLISHING THE IMPOSSIBLE
  • 35. Example: Mosaicked Video with KLV Footprints SLIDESHOW HEADER ACCOMPLISHING THE IMPOSSIBLE • Rip through KLV Metadata • Index frame footprints, and annotations as MBR into X(R*)-Tree • Leverage Geo- Sharding for spatially relevant scale
  • 36. Example Use Case – OSINT (Foursquare Data) • Sample Foursquare data set mashed with Government Intel Data (poly reports) • 100 million Geo Document test (3D points and polys) • 4 server replica set • ~350ms query response • ~300% improvement over PostGIS EXAMPLE ACCOMPLISHING THE IMPOSSIBLE
  • 37. Community Support • Thermopylae plans to open source – http://github.com/thermopylae • TST working with 10gen to offer as a spatial extension • Active developer collaboration – IRC: #mongodb freenode.net FIND US ACCOMPLISHING THE IMPOSSIBLE
  • 40. Key Customers - Government • US Dept of State Bureau of Diplomatic Security – Build and support 30 TB Google Earth Globe with multi- terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework. • US Army Intelligence Security Command – Provide expertise in managing technology integration – prime contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework. • US Southern Command – Coordinate Intelligence management systems spatial data collection, indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest. – Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard) GOVERNMENT CUSTOMERS ACCOMPLISHING THE IMPOSSIBLE
  • 41. COMMERCIAL CUSTOMERS ACCOMPLISHING THE IMPOSSIBLE Key Customers - Commercial Cleveland Cavaliers USGIF Las Vegas Motor Speedway Baltimore Grand Prix iSpatial framework serves millions of mobile devices
  • 42. • Expose and manage Multi-INT enterprise data in a geo-temporal user defined environment • Provide a flexible and scalable spatial data infrastructure (SDI) for Multi-INT data access and analysis • Spatially referenced data visualization on 3D globe & 2D maps • Access real/near real-time data feeds from forward deployed devices • Enable real-time information sharing and mission collaboration ISPATIAL OVERVIEW ACCOMPLISHING THE IMPOSSIBLE

Notas do Editor

  1. Screen shot of UDOP…blow-out of key features (sharing, presentation builder, etc)