This document summarizes a presentation about using Apache Lucene and Solr for local and geographic search. The presentation covered the basics of geographic search with Lucene and Solr, including spatial data types, integration with text, and application needs like efficient distance calculations and filtering. It provided examples of features in Lucene 2.9 and Solr 1.4 for spatial search, such as Cartesian tiers. The presentation also discussed how Solr powers local search at YP.com, including their custom relevance model and scalable search architecture.
2. Agenda
Grant Ingersoll, Lucid Imagination
Introduction
Basics of geo-spatial search
Tools available in Lucene and Solr
Ryan McKinley, Voyager GIS
Spatial search in Action:
Sameer Maggon, AT&T Interactive
How Solr powers local search at YP.com
Lucid Imagination, Inc.
3. Introductions
Grant Ingersoll
Lucene/Solr committer
Co-author of upcoming “Taming Text”
Ryan McKinley
Lucene/Solr committer
Co-founder of Voyager GIS
Sameer Maggon
Search Eng. Team lead at AT&T Interactive
Active user of Lucene since 2001
Lucid Imagination, Inc.
4. Use Cases
Asset Management
“Dude, where’s my map?”
Social Networking
Find all friends near me
Targeted, local search results and ads
“restaurants in Austin Texas”
“Starbucks, 55313”
Business Intelligence
Restrict doc set for analysis by location
Lucid Imagination, Inc.
5. Spatial Search Concepts
Spatial Data Types
Points (latitude/longitude)
Lines
Shapes
Maps and overlays
Streets, POI
http://www.openstreetmap.org/?lat=44.9744&lon=-93.2484&zoom=14&layers=B000FTFT
Integration with unstructured text
Metadata, descriptions, user reviews, etc.
Lucid Imagination, Inc.
6. Application Needs
Query Parsing
Efficient distance calculations
Euclidean, Great Circle (Haversine), Vincenty’s
Filtering
Bounding Box
Sort by Distance
Relevance Enhancement
Faceting
Advanced: shape intersections, routes
Lucid Imagination, Inc.
7. Lucene 2.9/Solr 1.4 Features for Spatial Search
Lucene/Solr are excellent for dealing with unstructured text
2.9/1.4 adds:
Better Numeric handling for range searches
Spatial contribution with features for (2.9 only, coming in 1.5):
• Creating Cartesian Tiers (Grids)
• Geohashes
• Calculating distances
• Filter implementations
Lucid Imagination, Inc.
8. Query Parsing
Query parsing is often the most difficult to get right
User error, ambiguity in names
Mixture of topic and location: bars in Minneapolis MN
Geocoding translates addresses, POIs into lat/lon or other
Several publicly available services: geonames.org, Google Maps
Often have built-in throttles, so may not be effective for prod.
Query logs are invaluable for developing an effective parser
Lucid Imagination, Inc.
9. Filtering
Range queries can significantly slow down search if done
improperly
Goal: reduce the number of terms to evaluate
Solution 1:
New Trie-based numeric capabilities
Solution 2:
Cartesian Tiers
Lucid Imagination, Inc.
10. Cartesian Tiers
Divide up the space into grids and assign it an id
Each tier breaks the space down into 2tier grids
Sample code using Lucene spatial contrib:
CartesianTierPlotter pl = new
CartesianTierPlotter(10, new
SinusoidalProjector(), "spatial");
pl.getTierBoxId(latitude, longitude);
See
http://www.nsshutdown.com/projects/lucene/wh
itepaper/locallucene_v2.html
Lucid Imagination, Inc.
11. What’s next?
Tighter integration in Solr
Work already under way
Native field types, query parsing support, faceting support
Resources
java-user@lucene,apache.org, solr-user@lucene.apache.org
https://issues.apache.org/jira/browse/SOLR-773
http://lucene.apache.org/java/2_9_1/api/contrib-
spatial/index.html
Many, many more general resources on the web
Lucid Imagination, Inc.
13. Where is my Data?
• Files stored across the network – desktop,
external drives, databases etc.
• Many distinct data formats
• Massive datasets keep getting bigger.
• Poor cataloging tools
• Limited metadata
14. Voyager Solution
Voyager is a search engine for your geographic data.
• Find data with simple text search and
geographic constraints
• Keep data in its existing location (no need to
import to a new system)
• Tools to work with search results
19. Data Extraction
• For each result, we extract basic information:
- ESRI ArcObjects
- GDAL
- PDFBox
- Geotools
- Tika
- etc
20. Geographic Search in Solr
• Need to search by ‘extent’ not point
• Works well with a standard RTree
• Built a custom Lucene Filter to
intersect/search within a given extent.
21. Work in Progress
• Custom Gazateer
– “Building 12” > ‘-96.X 30.X -96.X 30.X’
• Named Entity Extraction
– Geographic words that appear in titles / text get
indexed with geographic properties
22. Geographic Search in Solr 1.5+
• Standard API, pluggable implementation.
– Standard Qparser, pluggable indexing
• Single input ‘field’ could index multiple lucene
fields.
• Share objects between different parts of the
request cycle (only calculate distance once)
• Augment results with calculated value
– Manual or from function query
25. YP.com (beta)
Local Search Site
Focused on providing
relevant results
Uses Solr for search
AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 3
designated team(s) within the AT&T companies and not for general distribution
26. Technical Challenges
Relevancy Scalability
Topically relevant results 10s of millions of
records
Constrained by contextual
geographical search Response time less
than 200ms
Local relevancy is not just
keyword and location – Fault resistant
ratings, brands, etc More than 150 million
searches per month
AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 4
designated team(s) within the AT&T companies and not for general distribution
27. Custom Relevance Model
Topical + Geographical + Social
Complex handling of Distance modulation based on Business with 4.5 stars and
multiword queries business density 200 reviews is more relevant
than 5.0 star 1 review
AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 5
designated team(s) within the AT&T companies and not for general distribution
28. Custom Relevance Model
Topical + Geographical + Social
Complex handling of Distance modulation based on Business with 4.5 stars and
multiword queries business density 200 reviews is more relevant
than 5.0 star 1 review
Field Boosts for certain LocalSolr as a geographic CustomScoreQuery to tie
fields filter all different scores together
Dismax to handle complex Ability to modulate score
queries based on business density
AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 6
designated team(s) within the AT&T companies and not for general distribution
29. Geographic Sharding
Score Combinations
Performance was better
Provisioning is a bit complex
AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 7
designated team(s) within the AT&T companies and not for general distribution
30. Search Architecture
Search Slaves Masters
shards
API Layer
replication Feeder /
Document Pipeline
rows
AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 8
designated team(s) within the AT&T companies and not for general distribution
31. Bottom Line
Solr has enabled us to innovate faster
• Quick iterations of relevancy model and functionality
• Open Platform with much more flexibility
• Scalable Architecture to meet our business needs
32. Bottom Line
Solr has enabled us to innovate faster
• Quick iterations of relevancy model and functionality
• Open Platform with much more flexibility
• Scalable Architecture to meet our business needs
Thus, delivering value to our consumers
33. Resources
http://bit.ly/lucid-local
Lucid Imagination, Inc.