SlideShare uma empresa Scribd logo
1 de 35
Building Local/Geo Search
with Apache Lucene and Solr
Agenda



   Grant Ingersoll, Lucid Imagination
      Introduction
      Basics of geo-spatial search
      Tools available in Lucene and Solr
   Ryan McKinley, Voyager GIS
      Spatial search in Action:
   Sameer Maggon, AT&T Interactive
      How Solr powers local search at YP.com


                              Lucid Imagination, Inc.
Introductions
   Grant Ingersoll
         Lucene/Solr committer
         Co-author of upcoming “Taming Text”


   Ryan McKinley
         Lucene/Solr committer
         Co-founder of Voyager GIS


   Sameer Maggon
         Search Eng. Team lead at AT&T Interactive
         Active user of Lucene since 2001

                                 Lucid Imagination, Inc.
Use Cases



      Asset Management
        “Dude, where’s my map?”
      Social Networking
        Find all friends near me
      Targeted, local search results and ads
        “restaurants in Austin Texas”
        “Starbucks, 55313”
      Business Intelligence
        Restrict doc set for analysis by location

                                   Lucid Imagination, Inc.
Spatial Search Concepts



      Spatial Data Types
        Points (latitude/longitude)
        Lines
        Shapes


      Maps and overlays
        Streets, POI
                                         http://www.openstreetmap.org/?lat=44.9744&lon=-93.2484&zoom=14&layers=B000FTFT

      Integration with unstructured text
        Metadata, descriptions, user reviews, etc.

                                Lucid Imagination, Inc.
Application Needs



      Query Parsing
      Efficient distance calculations
        Euclidean, Great Circle (Haversine), Vincenty’s
      Filtering
        Bounding Box
      Sort by Distance
      Relevance Enhancement
      Faceting
      Advanced: shape intersections, routes

                                Lucid Imagination, Inc.
Lucene 2.9/Solr 1.4 Features for Spatial Search



      Lucene/Solr are excellent for dealing with unstructured text


      2.9/1.4 adds:
        Better Numeric handling for range searches


        Spatial contribution with features for (2.9 only, coming in 1.5):
        • Creating Cartesian Tiers (Grids)
        • Geohashes
        • Calculating distances
        • Filter implementations
                                   Lucid Imagination, Inc.
Query Parsing



      Query parsing is often the most difficult to get right
        User error, ambiguity in names
        Mixture of topic and location: bars in Minneapolis MN
      Geocoding translates addresses, POIs into lat/lon or other
        Several publicly available services: geonames.org, Google Maps
        Often have built-in throttles, so may not be effective for prod.


      Query logs are invaluable for developing an effective parser



                                Lucid Imagination, Inc.
Filtering



       Range queries can significantly slow down search if done
     improperly
       Goal: reduce the number of terms to evaluate
       Solution 1:
            New Trie-based numeric capabilities
       Solution 2:
            Cartesian Tiers




                                   Lucid Imagination, Inc.
Cartesian Tiers



     Divide up the space into grids and assign it an id
       Each tier breaks the space down into 2tier grids
       Sample code using Lucene spatial contrib:
   CartesianTierPlotter pl = new
    CartesianTierPlotter(10, new
    SinusoidalProjector(), "spatial");
   pl.getTierBoxId(latitude, longitude);
      See
   http://www.nsshutdown.com/projects/lucene/wh
   itepaper/locallucene_v2.html

                                 Lucid Imagination, Inc.
What’s next?



      Tighter integration in Solr
        Work already under way
        Native field types, query parsing support, faceting support


      Resources
        java-user@lucene,apache.org, solr-user@lucene.apache.org
        https://issues.apache.org/jira/browse/SOLR-773
        http://lucene.apache.org/java/2_9_1/api/contrib-
        spatial/index.html
        Many, many more general resources on the web
                                Lucid Imagination, Inc.
Voyager Spatial Data Search
                       Ryan McKinley
               Co-founder, Voyager GIS
Where is my Data?
• Files stored across the network – desktop,
  external drives, databases etc.
• Many distinct data formats
• Massive datasets keep getting bigger.
• Poor cataloging tools
• Limited metadata
Voyager Solution
Voyager is a search engine for your geographic data.

• Find data with simple text search and
  geographic constraints
• Keep data in its existing location (no need to
  import to a new system)
• Tools to work with search results
Implementation
• Data Discovery / Extraction
• Solr search
• Wicket UI
Data Extraction
• For each result, we extract basic information:




- ESRI ArcObjects
- GDAL
- PDFBox
- Geotools
- Tika
- etc
Geographic Search in Solr
• Need to search by ‘extent’ not point
• Works well with a standard RTree
• Built a custom Lucene Filter to
  intersect/search within a given extent.
Work in Progress
• Custom Gazateer
  – “Building 12” > ‘-96.X 30.X -96.X 30.X’


• Named Entity Extraction
  – Geographic words that appear in titles / text get
    indexed with geographic properties
Geographic Search in Solr 1.5+
• Standard API, pluggable implementation.
  – Standard Qparser, pluggable indexing
• Single input ‘field’ could index multiple lucene
  fields.
• Share objects between different parts of the
  request cycle (only calculate distance once)
• Augment results with calculated value
  – Manual or from function query
How Solr powers local search at
           YP.com



           Sameer Maggon
           November 18, 2009




© 2008 AT&T Intellectual Property. All rights reserved.
AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.
YP.com
        Technical Challenges
        Custom Relevance Model
        Scalability / Architecture
        Conclusion




© 2008 AT&T Intellectual Property. All rights reserved.
AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.
YP.com (beta)


Local Search Site


Focused on providing
relevant results


Uses Solr for search




                       AT&T Proprietary (Restricted) Only for use by authorized individuals or any above-   3
                        designated team(s) within the AT&T companies and not for general distribution
Technical Challenges



        Relevancy                                                                                Scalability

Topically relevant results                                         10s of millions of
                                                                   records
Constrained by contextual
geographical search                                                Response time less
                                                                   than 200ms
Local relevancy is not just
keyword and location –                                             Fault resistant
ratings, brands, etc                                               More than 150 million
                                                                   searches per month




                        AT&T Proprietary (Restricted) Only for use by authorized individuals or any above-     4
                         designated team(s) within the AT&T companies and not for general distribution
Custom Relevance Model


  Topical             +     Geographical                                                                    +               Social

Complex handling of       Distance modulation based on                                                              Business with 4.5 stars and
multiword queries         business density                                                                          200 reviews is more relevant
                                                                                                                    than 5.0 star 1 review




                               AT&T Proprietary (Restricted) Only for use by authorized individuals or any above-                              5
                                designated team(s) within the AT&T companies and not for general distribution
Custom Relevance Model


   Topical             +     Geographical                                                                    +               Social

Complex handling of        Distance modulation based on                                                              Business with 4.5 stars and
multiword queries          business density                                                                          200 reviews is more relevant
                                                                                                                     than 5.0 star 1 review




Field Boosts for certain    LocalSolr as a geographic                                                                CustomScoreQuery to tie
fields                      filter                                                                                   all different scores together
Dismax to handle complex    Ability to modulate score
queries                     based on business density




                                AT&T Proprietary (Restricted) Only for use by authorized individuals or any above-                                   6
                                 designated team(s) within the AT&T companies and not for general distribution
Geographic Sharding


                                                           Score Combinations

                                                           Performance was better


                                                           Provisioning is a bit complex




               AT&T Proprietary (Restricted) Only for use by authorized individuals or any above-   7
                designated team(s) within the AT&T companies and not for general distribution
Search Architecture

                 Search Slaves                                                                      Masters

                                                        shards
    API Layer




                                                             replication                                          Feeder /
                                                                                                              Document Pipeline




                rows

                       AT&T Proprietary (Restricted) Only for use by authorized individuals or any above-                     8
                        designated team(s) within the AT&T companies and not for general distribution
Bottom Line



Solr has enabled us to innovate faster
   • Quick iterations of relevancy model and functionality
   • Open Platform with much more flexibility
   • Scalable Architecture to meet our business needs
Bottom Line



Solr has enabled us to innovate faster
   • Quick iterations of relevancy model and functionality
   • Open Platform with much more flexibility
   • Scalable Architecture to meet our business needs




Thus, delivering value to our consumers
Resources




       http://bit.ly/lucid-local




                     Lucid Imagination, Inc.
Q&A


Lucid Imagination, Inc.
http://bit.ly/lucid-local

Mais conteúdo relacionado

Destaque

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Tennis
TennisTennis
Tennis
aritz
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
tanica
 
Cancer
CancerCancer
Cancer
tanica
 

Destaque (20)

Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
Juan gris
Juan grisJuan gris
Juan gris
 
情報科学演習 09
情報科学演習 09情報科学演習 09
情報科学演習 09
 
Ecma 262 5th Edition を読む #5 第9条
Ecma 262 5th Edition を読む #5 第9条Ecma 262 5th Edition を読む #5 第9条
Ecma 262 5th Edition を読む #5 第9条
 
What’s new in apache solr 1.4
What’s new in apache solr 1.4What’s new in apache solr 1.4
What’s new in apache solr 1.4
 
Updated: Sources of Funding
Updated:  Sources of FundingUpdated:  Sources of Funding
Updated: Sources of Funding
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
Van gogh
Van goghVan gogh
Van gogh
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
корея
кореякорея
корея
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence community
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Tennis
TennisTennis
Tennis
 
Jonh Lennon
Jonh LennonJonh Lennon
Jonh Lennon
 
Oslb office365
Oslb office365Oslb office365
Oslb office365
 
20101023 ie9 cache
20101023 ie9 cache20101023 ie9 cache
20101023 ie9 cache
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
Cancer
CancerCancer
Cancer
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 

Semelhante a Building Local/Geo Search with Apache Lucene and Solr

Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
SEO CAMP
 
Bringing Geospatial Business Intelligence to the Enterprise
Bringing Geospatial Business Intelligenceto the EnterpriseBringing Geospatial Business Intelligenceto the Enterprise
Bringing Geospatial Business Intelligence to the Enterprise
mkarren
 
AWS Total Cost of Ownership Hong Kong and Taiwan
AWS Total Cost of Ownership Hong Kong and TaiwanAWS Total Cost of Ownership Hong Kong and Taiwan
AWS Total Cost of Ownership Hong Kong and Taiwan
Amazon Web Services
 
Etendez votre datacenter avec aws v4
Etendez votre datacenter avec aws v4Etendez votre datacenter avec aws v4
Etendez votre datacenter avec aws v4
Amazon Web Services
 

Semelhante a Building Local/Geo Search with Apache Lucene and Solr (20)

Local Search using Solr at YP.com
Local Search using Solr at YP.comLocal Search using Solr at YP.com
Local Search using Solr at YP.com
 
Solr the intelligent search engine
Solr the intelligent search engineSolr the intelligent search engine
Solr the intelligent search engine
 
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
 
The Next Generation of Big Data Analytics
The Next Generation of Big Data AnalyticsThe Next Generation of Big Data Analytics
The Next Generation of Big Data Analytics
 
Bringing Geospatial Business Intelligence to the Enterprise
Bringing Geospatial Business Intelligenceto the EnterpriseBringing Geospatial Business Intelligenceto the Enterprise
Bringing Geospatial Business Intelligence to the Enterprise
 
7 dee finding the right methodologies marshall sponder - 9-12-12 - submitted
7 dee finding the right methodologies   marshall sponder - 9-12-12 - submitted7 dee finding the right methodologies   marshall sponder - 9-12-12 - submitted
7 dee finding the right methodologies marshall sponder - 9-12-12 - submitted
 
Introduction to FluentData - The Micro ORM
Introduction to FluentData - The Micro ORMIntroduction to FluentData - The Micro ORM
Introduction to FluentData - The Micro ORM
 
Esri Application on AWS Cloud Webinar
Esri Application on AWS Cloud WebinarEsri Application on AWS Cloud Webinar
Esri Application on AWS Cloud Webinar
 
Database@Home - Maps and Spatial Analyses: How to use them
Database@Home - Maps and Spatial Analyses: How to use themDatabase@Home - Maps and Spatial Analyses: How to use them
Database@Home - Maps and Spatial Analyses: How to use them
 
Being a mobile entrepreneur
Being a mobile entrepreneurBeing a mobile entrepreneur
Being a mobile entrepreneur
 
AWS Total Cost of Ownership Hong Kong and Taiwan
AWS Total Cost of Ownership Hong Kong and TaiwanAWS Total Cost of Ownership Hong Kong and Taiwan
AWS Total Cost of Ownership Hong Kong and Taiwan
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Etendez votre datacenter avec aws v4
Etendez votre datacenter avec aws v4Etendez votre datacenter avec aws v4
Etendez votre datacenter avec aws v4
 
FME Geo Enabling Field Sales Team
FME Geo Enabling Field Sales TeamFME Geo Enabling Field Sales Team
FME Geo Enabling Field Sales Team
 
Enterprise Location Intelligence
Enterprise Location IntelligenceEnterprise Location Intelligence
Enterprise Location Intelligence
 
Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3Présentation IBM InfoSphere MDM 11.3
Présentation IBM InfoSphere MDM 11.3
 
2012 06 hortonworks paris hug
2012 06 hortonworks paris hug2012 06 hortonworks paris hug
2012 06 hortonworks paris hug
 
Domain driven design
Domain driven designDomain driven design
Domain driven design
 

Mais de Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

Mais de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Building Local/Geo Search with Apache Lucene and Solr

  • 1. Building Local/Geo Search with Apache Lucene and Solr
  • 2. Agenda Grant Ingersoll, Lucid Imagination Introduction Basics of geo-spatial search Tools available in Lucene and Solr Ryan McKinley, Voyager GIS Spatial search in Action: Sameer Maggon, AT&T Interactive How Solr powers local search at YP.com Lucid Imagination, Inc.
  • 3. Introductions Grant Ingersoll Lucene/Solr committer Co-author of upcoming “Taming Text” Ryan McKinley Lucene/Solr committer Co-founder of Voyager GIS Sameer Maggon Search Eng. Team lead at AT&T Interactive Active user of Lucene since 2001 Lucid Imagination, Inc.
  • 4. Use Cases Asset Management “Dude, where’s my map?” Social Networking Find all friends near me Targeted, local search results and ads “restaurants in Austin Texas” “Starbucks, 55313” Business Intelligence Restrict doc set for analysis by location Lucid Imagination, Inc.
  • 5. Spatial Search Concepts Spatial Data Types Points (latitude/longitude) Lines Shapes Maps and overlays Streets, POI http://www.openstreetmap.org/?lat=44.9744&lon=-93.2484&zoom=14&layers=B000FTFT Integration with unstructured text Metadata, descriptions, user reviews, etc. Lucid Imagination, Inc.
  • 6. Application Needs Query Parsing Efficient distance calculations Euclidean, Great Circle (Haversine), Vincenty’s Filtering Bounding Box Sort by Distance Relevance Enhancement Faceting Advanced: shape intersections, routes Lucid Imagination, Inc.
  • 7. Lucene 2.9/Solr 1.4 Features for Spatial Search Lucene/Solr are excellent for dealing with unstructured text 2.9/1.4 adds: Better Numeric handling for range searches Spatial contribution with features for (2.9 only, coming in 1.5): • Creating Cartesian Tiers (Grids) • Geohashes • Calculating distances • Filter implementations Lucid Imagination, Inc.
  • 8. Query Parsing Query parsing is often the most difficult to get right User error, ambiguity in names Mixture of topic and location: bars in Minneapolis MN Geocoding translates addresses, POIs into lat/lon or other Several publicly available services: geonames.org, Google Maps Often have built-in throttles, so may not be effective for prod. Query logs are invaluable for developing an effective parser Lucid Imagination, Inc.
  • 9. Filtering Range queries can significantly slow down search if done improperly Goal: reduce the number of terms to evaluate Solution 1: New Trie-based numeric capabilities Solution 2: Cartesian Tiers Lucid Imagination, Inc.
  • 10. Cartesian Tiers Divide up the space into grids and assign it an id Each tier breaks the space down into 2tier grids Sample code using Lucene spatial contrib: CartesianTierPlotter pl = new CartesianTierPlotter(10, new SinusoidalProjector(), "spatial"); pl.getTierBoxId(latitude, longitude); See http://www.nsshutdown.com/projects/lucene/wh itepaper/locallucene_v2.html Lucid Imagination, Inc.
  • 11. What’s next? Tighter integration in Solr Work already under way Native field types, query parsing support, faceting support Resources java-user@lucene,apache.org, solr-user@lucene.apache.org https://issues.apache.org/jira/browse/SOLR-773 http://lucene.apache.org/java/2_9_1/api/contrib- spatial/index.html Many, many more general resources on the web Lucid Imagination, Inc.
  • 12. Voyager Spatial Data Search Ryan McKinley Co-founder, Voyager GIS
  • 13. Where is my Data? • Files stored across the network – desktop, external drives, databases etc. • Many distinct data formats • Massive datasets keep getting bigger. • Poor cataloging tools • Limited metadata
  • 14. Voyager Solution Voyager is a search engine for your geographic data. • Find data with simple text search and geographic constraints • Keep data in its existing location (no need to import to a new system) • Tools to work with search results
  • 15.
  • 16.
  • 17.
  • 18. Implementation • Data Discovery / Extraction • Solr search • Wicket UI
  • 19. Data Extraction • For each result, we extract basic information: - ESRI ArcObjects - GDAL - PDFBox - Geotools - Tika - etc
  • 20. Geographic Search in Solr • Need to search by ‘extent’ not point • Works well with a standard RTree • Built a custom Lucene Filter to intersect/search within a given extent.
  • 21. Work in Progress • Custom Gazateer – “Building 12” > ‘-96.X 30.X -96.X 30.X’ • Named Entity Extraction – Geographic words that appear in titles / text get indexed with geographic properties
  • 22. Geographic Search in Solr 1.5+ • Standard API, pluggable implementation. – Standard Qparser, pluggable indexing • Single input ‘field’ could index multiple lucene fields. • Share objects between different parts of the request cycle (only calculate distance once) • Augment results with calculated value – Manual or from function query
  • 23. How Solr powers local search at YP.com Sameer Maggon November 18, 2009 © 2008 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.
  • 24. YP.com Technical Challenges Custom Relevance Model Scalability / Architecture Conclusion © 2008 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.
  • 25. YP.com (beta) Local Search Site Focused on providing relevant results Uses Solr for search AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 3 designated team(s) within the AT&T companies and not for general distribution
  • 26. Technical Challenges Relevancy Scalability Topically relevant results 10s of millions of records Constrained by contextual geographical search Response time less than 200ms Local relevancy is not just keyword and location – Fault resistant ratings, brands, etc More than 150 million searches per month AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 4 designated team(s) within the AT&T companies and not for general distribution
  • 27. Custom Relevance Model Topical + Geographical + Social Complex handling of Distance modulation based on Business with 4.5 stars and multiword queries business density 200 reviews is more relevant than 5.0 star 1 review AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 5 designated team(s) within the AT&T companies and not for general distribution
  • 28. Custom Relevance Model Topical + Geographical + Social Complex handling of Distance modulation based on Business with 4.5 stars and multiword queries business density 200 reviews is more relevant than 5.0 star 1 review Field Boosts for certain LocalSolr as a geographic CustomScoreQuery to tie fields filter all different scores together Dismax to handle complex Ability to modulate score queries based on business density AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 6 designated team(s) within the AT&T companies and not for general distribution
  • 29. Geographic Sharding Score Combinations Performance was better Provisioning is a bit complex AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 7 designated team(s) within the AT&T companies and not for general distribution
  • 30. Search Architecture Search Slaves Masters shards API Layer replication Feeder / Document Pipeline rows AT&T Proprietary (Restricted) Only for use by authorized individuals or any above- 8 designated team(s) within the AT&T companies and not for general distribution
  • 31. Bottom Line Solr has enabled us to innovate faster • Quick iterations of relevancy model and functionality • Open Platform with much more flexibility • Scalable Architecture to meet our business needs
  • 32. Bottom Line Solr has enabled us to innovate faster • Quick iterations of relevancy model and functionality • Open Platform with much more flexibility • Scalable Architecture to meet our business needs Thus, delivering value to our consumers
  • 33. Resources http://bit.ly/lucid-local Lucid Imagination, Inc.