Presented by David Smiley, Software Systems Engineer, Lead, MITRE
Lucene’s former spatial contrib is gone and in its place is an entirely new spatial module developed by several well-known names in the Lucene/Solr spatial community. The heart of this module is an approach in which spatial geometries are indexed using edge-ngram tokenized geohashes searched with a prefix-tree/trie recursive algorithm. It sounds cool and it is! In this presentation, you’ll see how it works, why it’s fast, and what new things you can do with it. Key features are support for multi-valued fields, and indexing shapes with area -- even polygons, and support for various spatial predicates like “Within”. You’ll see a live demonstration and a visual representation of geohash indexed shapes. Finally, the session will conclude with a look at the future direction of the module.
3. About David Smiley
• Working at MITRE, for 13 years
• web development, Java, search
• 3 Solr apps, 1 Endeca
• Published 1st book on Solr; then 2nd edition (2009, 2011)
• Apache Lucene / Solr committer/PMC member (2012)
• Specializing on spatial
• Presented at Lucene Revolution (2010) & Basis O.S.
Search Conference (2011, 2012)
• Taught Solr classes at MITRE (2010, 2011, 2012)
• Solr search consultant within MITRE and its sponsors,
and privately
3
6. What is Spatial Search?
Popular features:
• Spatial filter query
• Spatial distance sorting
• Spatial distance relevancy (i.e. spatial query score)
NOT “geocoding” – resolve “Boston” to its latitude and longitude
Typical use-case:
1. Index a location for each Lucene document given a
latitude & longitude
2. Then search for matching documents by a circle (point-
radius) or bounding box
3. Then sort results by distance
7. History of Spatial for Lucene & Solr
• 2007: Local-Lucene
• by Patric O’Leary (AOL)
• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0
• Local-Lucene graduates to an official Lucene contrib module
• 2009-12: Spatial Search Plugin (SSP) for Solr
• by Chris Male (JTeam -> Orange11, ElasticSearch)
• 2010-10: SOLR-2155 a geohash prefix tree filter
• by David Smiley (MITRE)
• 2011-01: Lucene Spatial Playground (LSP)
• by Ryan McKinley (Voyager GIS), David, and Chris
• 2011-03: Solr 3.1 new spatial features
• by Grant Ingersoll and Yonik Seeley (LucidWorks)
• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j + SSP
• replaces former Lucene spatial contrib module
8. Lucene Spatial Committers
• David Smiley
• Works for MITRE
• Boston area
• Ryan McKinley
• Works for Voyager GIS
• Silicon Valley
• Chris Male,
• Formerly at Elastic Search
• New Zealand
10. Lines of Code for Spatial Components
Spatial4j
43%
Lucene spatial
35%
Solr adapters
6%
Misc
16%
Total: 4,781 Non-Comment Source Statements (without javadocs or tests)
as of 2012-09
11. CarrotSearch Labs’ RandomizedTesting
• http://labs.carrotsearch.com/randomizedtesting.html
• Provides plumbing for repeatable randomized JUnit tests
• All the spatial test code uses it extensively
Randomized testing more generally is a certain
philosophy / approach on how to test
• A typical hard-coded test will only catch some regressions
• A randomized test will catch just about anything
eventually, especially nasty edge cases
• Although it’s hard to read / write / maintain these tests
• Randomized testing helped find bugs related to…
• Computing the bounding box of a circle
• Computing the relationship of a circle to a rectangle that has all 4 of
its corners inside it
13. Spatial4j: It’s all about the shapes
https://github.com/spatial4j/spatial4j (spatial4j.com redirect)
• Shapes
• A “Shape” abstraction with multiple implementations
• Geodetic (sphere) & Cartesian/2D implementations
• Computes intersection relationship with other shapes
• Also…
• Distance and area math utilities, Geohash utilities
• Parsing Well Known Text (WKT) formatted shapes
• ASL licensed project independent of Apache on GitHub
• Requires JTS (LGPL licensed) for polygons & WKT*
• JTS is “JTS Topology Suite”
• * WKT parsing soon to be implemented directly by Spatial4j
• Ported to .NET as Spatial4n and used by RavenDB
• by Itamar Syn-Herskhko
14. The case for Spatial4j’s existence
• Just for shapes? How much code could there be?
• You’d be surprised. Determining the relationship between a lat-lon
rectangle and a geodetic circle (Within, Contains, Intersects, Disjoint)
is non-trivial, and that’s just one shape.
• Lots of non-trivial test code go with it.
• Why isn’t it a part of Lucene spatial?
• Parts of Spatial4j depend on JTS, an LGPL licensed library. The
Lucene PMC voted not to introduce this compile-time dependency.
• Spatial4j is independently useful.
• Is this duplication of other open-source that could be used?
• Spatial4j needs to be ASL licensed to be a dependency of Lucene.
• Still… I haven’t found existing code that does what Spatial4j does.
• Can’t only the JTS dependent parts be external to Lucene?
15. The Shape interface
(may become an abstract class in the next version)
• interface Shape {
• Point getCenter();
• Rectangle getBoundingBox();
• boolean hasArea();
• double getArea();
• SpatialRelation relate(Shape other);
• Must support Point & Rectangle
• enum SpatialRelation
• DISJOINT, INTERSECTS, WITHIN, CONTAINS
• Note: simpler set than the “DE-9IM” spatial standard
• no “equals” or “touches”
16. Spatial4j shapes
Cartesian
Cartesian
with
dateline
wrap
Geodetic
Point Y Y Y
Line & LineString
(w/ buffer)
Y N N
Rectangle Y Y Y
Circle Y N Y
ShapeCollection Y Y Y
JTS Geometry
(incl. polygons)
Y Y N
• Cartesian (AKA
Euclidean): a flat plane
• Dateline wrap assumes
the plane circles back on
itself
• Geodetic: a spherical
mathematical model
17. Well Known Text (WKT)
(see Wikipedia)
• A popular standard for
representing shapes as
strings
• Requires JTS’s WKT
Parser but Spatial4j has
its own in-progress
• Extensions are TBD for
Rectangles and Circles
• Limited support for
EMPTY and “Z” and “M”
dimensions (future)
• Some Examples:
• POINT (3, -2)
• LINESTRING(30 10, 10 30, …
• POLYGON ((30 10, 10 20, 20
40, 40 40, 30 10))
• MULTIPOLYGON (((…
• …
• Deprecated (may move
to Solr):
• -90, -180
• -180 -90 180 90
• CIRCLE(4.56,1.23 d=0.071)
• TBD / Pending:
• ENVELOPE(-180,180,90,-90)
• BOX2D(-180 -90, 180 90)
18. Spatial4j code sample
SpatialContext ctx = SpatialContext.GEO;
Rectangle r = ctx.makeRectangle(-71, -70, 42, 43);
Circle c = ctx.makeCircle(-72, 42, 1);
SpatialRelation rel = r.relate(c);
System.out.println(rel);
rel.intersects();//boolean
ctx = JtsSpatialContext.GEO;
Shape s = ctx.readShape(“POLYGON ((30 10, 10 20, 20 40, 40
40, 30 10))”);
double distanceDegrees = ctx.getDistCalc().distance(
ctx.makePoint(2, 2), ctx.makePoint(3, 3) );
Distances (including circle
radius) are in “Degrees”, not
radians or KM
19. Spatial4j Future
• Built-in WKT support (no JTS dependency)
• Extensible to user-defined shapes
• API improvements
• Shape argument validation via WKT but not via ctx.makeShape(…)
• ShapeCollection visitor design pattern
• Refactor to remove need for isGeo()
• LineString dateline & geodetic support
• Projection / Datum support
21. Lucene 4 Spatial Module
• There isn’t one best way to implement spatial indexing for
all use-cases
• Index just points, or other shapes too? Which?
• Multiple shapes per field?
• Query by Intersection? Contains? Within? Equals? Disjoint? …
• Distance sorting? Query boost by distance?
• Or more exotic shape relevancy like overlap percentage?
• Tradeoff shape precision for speed?
• Multiple SpatialStrategy implementations:
• RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy
• PointVectorStrategy
• BBoxStrategy (currently in trunk, not 4x)
• JtsGeoStrategy (in Spatial Solr Sandbox)
22. Strategy: PointVector
• Similar to Solr’s PointType / LatLonType
• X & Y trie double fields; caching via FieldCache
• Characteristics
• Indexes points (only)
• Single-valued field (no multi)
• Query by rectangle or circle (only)
• Circle uses FieldCache (requires memory)
• Circle does bbox pre-filter for performance
• Relations: Intersects, Within (only)
• Exact precision for x & y coordinates and query shape
• Distance sort
• Uses FieldCache (requires memory)
23. Strategy: BBox
• Implemented with 4 doubles & 1 boolean
• Ported from ESRI GeoPortal (Open Source)
• Characteristics:
• Indexes rectangles (only)
• Single-valued field (no multi)
• Query by rectangle (only)
• Supports all relations: Intersects, Within, Contains, …
• Distance sort from box center
• Uses FieldCache (requires memory)
• Area overlap sorting
• Sort results by percentage overlap between query and indexed boxes
• Uses FieldCache (requires memory)
• Note: FieldCache needs are somewhat high
24. Strategy: JtsGeoStrategy
• Stores a JTS geometry in Lucene 4’s DocValues
• Stores WKB (WKT in binary format)
• Full vector geometry is retained for search
• DocValues is mostly a better FieldCache
• Faster loading into memory
• Can be disk resident or memory
• Multi-valued
• Characteristics:
• Indexes any shape, including Multi… varieties
• Query by any shape
• Uses DocValues (memory use optional)
• Supports all relations: intersect, within, contains, …
• Could easily also support JTS’s exotic DE-9IM based relations
• Exact precision to the vector geometry
• No sorting
• Experimental / immature status
More of a proof-of-concept for now
26. Strategy: RecursivePrefixTree
• Grid / Tile / Trie / Prefix-
Tree based
• With recursive decent
algorithms
• Or TermQueryPrefixTree
alternative
• Choose Geohash (geo
only) or Quad tree
• The most mature
strategy to date
• Highly tested
• The current evolution of
SOLR-2155
27. Strategy: RecursivePrefixTree
• Characteristics:
• Indexes all shapes
• Variable precision of shape edges
• Highly precise shapes other than Point won’t scale
• LineString possibly not precise enough for your needs
• Multi-valued field support
• Query by any shape
• Variable precision for query shape
• Highest precision usually scales
• All Relations: Intersects, Within, Contains, Disjoint
• Distance sort (w/ multi-value support)
• Warning: immature, won’t scale
• Uses significant amounts of memory
• Fast scalable spatial filtering; no caches needed
new in Lucene 4.3
How many search /
NoSQL systems have
these capabilities?
28. Geohashes
• What is a Geohash?
• A lat/lon geocode system
• Has a hierarchical spatial structure
• Gradual precision degradation
• In the public domain
http://en.wikipedia.org/wiki/Geohash
• Example: (Boston) DRT2Y
36. Demo
• Spatial Solr Playground
• Demo KML grid generation from geometries
• A sample point with quad tree indexes to these tokens:
• A, AD, ADB, ADBA
• A sample circle with quad tree indexes to these tokens:
• A, AB, ABA, ABAB+, ABAC+, ABAD+, ABB, ABBA+,
ABBB+, ABBC+, ABBD+, ABC, ABCA+, ABCB+, ABCC+,
ABCD+, ABD+, AD, ADA, ADAA+, ADAB+, ADAC+, ADAD+,
ADB+, ADC, ADCA+, ADCB+, ADCD+, ADD, ADDA+,
ADDB+, ADDC+, ADDD+, B, BA, BAA, BAAC+, BAAD+,
BAC, BACA+, BACB+, BACC+, BACD+, BC, BCA, BCAA+,
BCAB+, BCAC+, BCC, BCCA+, BCCC+, C, CB, CBB,
CBBA+
• Tokens with a ‘+’ are actually indexed with and without the ‘+’
38. Lucene Spatial example code
ctx = SpatialContext.GEO;
strategy = new RecursivePrefixTreeStrategy(
new GeohashPrefixTree(ctx,11), “myGeoField”);
… // make indexWriter and a Document
for (Field f : strategy.createIndexableFields(shape))
doc.add(f);
indexWriter.addDocument(doc);
…
filter = strategy.makeFilter(
new SpatialArgs(SpatialOperation.Intersects,
ctx.makeCircle(-80.0, 33.0,
DistanceUtils.dist2Degrees(200,
DistanceUtils.EARTH_MEAN_RADIUS_KM))));
indexSearcher.search(userKeywordQuery, filter, 10);
See SpatialExample.java in Lucene spatial tests for more
39. Future
• Possible de-emphasis of SpatialStrategy abstraction
• A better options for distance sorting of PrefixTree
strategies
• Better PrefixTree encoding than both geohash & quad
tree
• Google Summer of Code 2013 -- TBD
• Performance improvements to spatial Intersects
RecursivePrefixTree Filter
• Remove the need to double-index leaf-nodes (with and
without ‘+’)
• Exact geometry search by blending benefits of PrefixTree
and JtsGeoStrategy
• A Single-dimensional PrefixTree (for numeric range index)
41. Solr 3 Spatial: LatLonType & friends
• Solr 3 was Solr’s first release to include spatial support
• Not based on Lucene’s old spatial contrib module
• Similar to TwoDoublesStrategy but more optimized
• Single-valued only, fast distance sorting, can choose floats (save
memory)
• Fields:
• LatLonType (Geodetic)
• PointType (Cartesian)
• Query parsers (spatial filters):
• {!geofilt} (circle) “p” and “sfield” and “d” params
• {!bbox} (bounding box of a circle)
• Distance function:
• geodist() and some esoteric others
NOT completely
superseded by Solr 4
spatial fields
42. Solr 4 Spatial
• See
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial
4
<fieldType name="location_rpt"
class="solr.SpatialRecursivePrefixTreeFieldType”
spatialContextFactory=”
com.spatial4j.core.context.jts.JtsSpatialContextFactory”
distErrPct="0.025”
maxDistErr="0.000009”
units="degrees” />
If you don’t need JTS
(polygons) don’t set this
Non-point shapes
approximated to
grid up to 2.5% of
radius
Max precision (1m) as
measured in degrees
43. Indexing
• Point: Latitude, Longitude (i.e. Y, X)
<field name="geo">43.17614, -90.57341</field>
• Point: X Y
<field name="geo">-90.57341 43.17614</field>
• Rect: minX minY maxX maxY
<field name="geo">-74.093 41.042 -69.347 44.558</field>
• Circle: point then d=radius (in degrees)
• will be deprecated
<field name="geo">Circle(4.56,1.23 d=0.0710)</field>
• WKT (preferred; it’s a standard)
<field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20,
0 0, -10 30))</field>
44. Filter (search)
• Using Solr 3’s bbox or geofilt query parsers
• Distance radius ‘d’ is interpreted as kilometers, just like LatLonType
• Limited to bbox and bbox of a circle
fq={!geofilt}&sfield=geo&pt=45.15,-93.85&d=5
• Range query style (bounding box)
• Handles dateline wrap
fq=geo:[-90,-180 TO 90,180]
• Field query style
• Unique to Lucene 4 spatial; see SpatialArgsParser
fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40
20, 0 0, -10 30))) distErrPct=0”
• Predicates: Intersects, IsDisjointTo, IsWithin,
Contains, …
• distErrPct (& distErr) optional; override field type’s default
SOLR-4242: A
better spatial
query parser
45. Distance Sort & Relevancy Boost
• geodist() is for Solr 3 LatLonType only
sort=geodist(lltField,45.15,-93.85) desc
• Solr 4 spatial queries can return the distance as the score
q={!geofilt sfield=geo pt=45.15,-93.85 d=5
score=distance}&sort=score asc&fl=*,score
• Without a filter
sort=query($sortsq) asc&sortsq={!geofilt filter=false
score=distance sfield=geo pt=45.15,-93.85 d=0}
• Relevancy boost
defType=edismax&boost=query($mysq)&mysq={!geofilt
filter=false score=recipDistance pt=45.15,-98.85
d=5}
46. Distance Faceting
• sfield=geo (the field)
• pt=45.15,-93.85 (point of reference)
• Within 10km
• facet.query={!geofilt d=10}
• Within 50km
• facet.query={!geofilt d=50}
• Within 100km
• facet.query={!geofilt d=100}
47. Future
• A more Solr-friendly spatial query parser SOLR-4242
• Retrofit geodist() to support the SpatialStrategies?
• Expose more tunables
• A grid based heat-map faceting component
• Idea: a multi-strategy spatial field encompassing
• A PrefixTree field for points
• A PrefixTree field for non-points
• A TwoDoubles field for good distance sorting / relevancy
• Knows whether its single vs. multi-valued
• A FieldType for multi-value numeric ranges
50. 1. Geohash each point to multiple lengths and index each
length into its own field
• geohash_1:D, geohash_2:DR, geohash_3:DRT, geohash_4:DRT2
2. Search with a rectangle (bbox) filter, and…
3. Facet on the geohash field with the desired resolution
• facet.field=geohash_4
&facet.limit=10000
• Lots of tuning / customization
options
• Projected / quad tree
• facet.prefix may help
Heatmap / Grid faceting
51. Plotting many points on a map
• Why not ask Solr for rows=1000 ?
• It’s slow
• If variable-points per doc then could yield be 1 distinct point or 1M
• Instead facet on a geohash with facet.limit=1000
• Fast
• Guaranteed <= 1000 points
• But might need lots of memory
• Or result-grouping on a geohash
But do you really want
to plot 1000+ points
on a map?
52. Filter by indexed distance constraints
• Imagine a dating site where both potential parties have a
maximum distance they’re willing to travel
• Q: For the current user, who is not “too far” for you but is
also not “too far” for them?
• A: Index each user’s location as a point in one field and
as a circle in another. Query by the current user’s circle to
the indexed point field as well as the current user’s point
to the indexed circle field.
53. Multi-valued durations
• What if your documents needed a variable number of time (or
other numerical value) durations
• This approach won’t work:
<field name=“start” type=“tdate” multiValued=“true”/>
<field name=“end” type=“tdate” multiValued=“true”/>
• Solr (without Solr 4 spatial fields) can’t do it!
• You need to think differently to solve this…
http://wiki.apache.org/solr/SpatialForTimeDurations
• Example use-cases
• Searching for hotel-room vacancies
• Searching for movie show-times
• (next slides) Each document is a person with a variable number of
“shifts” that they are working…
56. … some config & search details
• Configuration
<fieldType name="days_of_year”
class="solr.SpatialRecursivePrefixTreeFieldType"
geo="false" units="degrees"
worldBounds="0 0 365 365"
distErrPct="0" maxDistErr="1"/>
• Sample search: Find shifts that have any overlap with 19th day to 23rd
daysOfYear:Intersects(0 18.5 23.5 365)
• Caveat: Won’t scale to the full precision of a java Long (timestamp)