SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Natural language search in Solr
             Tommaso Teofili, Sourcesense
   t.teofili@sourcesense.com, October 19th 2011
Agenda
 An approach to natural language search in
  Solr
 Main points
  •   Solr-UIMA integration module
  •   Custom Lucene analyzers for UIMA
  •   OSS NLP algorithms in Lucene/Solr
  •   Orchestrating blocks to build a sample
      system able to understand natural language
      queries
 Results
My Background
 Software engineer at Sourcesense
  • Enterprise search consultant
 Member of the Apache Software Foundation
  •   UIMA
  •   Clerezza
  •   Stanbol
  •   DirectMemory
  •   ...
Google in ‘99
Google today
Google today
The Challenge
 Improved recall/precision
  • ‘articles about science’ (concepts)
  • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’
 Easier experience for non-expert users
  • ‘people working at Google’ - ‘cities near London’
 Horizontal domains (e.g. Google)
 Vertical domains
Hurdles
 understanding documents’ text/user queries
 extract domain-specific/wide entities and
  concepts
 index/search performance
Use Case
   search engine for an online movies magazine
   Solr based
   non technical users
   time / cost
    • Solr 3.x setup : 2 mins
    • NLS setup / tweak : 5 days
 expecting
    • improved recall / precision
    • more time (clicks) on site ($)
Online movies magazine
General approach
 Natural language processing
 Processing documents at indexing time
  • document text analysis
  • write enriched text in (dedicated) fields
  • add custom types / payloads to terms
 Processing queries at searching time
  •   query analysis
  •   higher boosts to entities/concepts
  •   in-sentence search
  •   ...
NLP
 AI discipline
   • Computers understanding and managing
     information written in human language
 analyze text at various levels
 incrementally enrich / give structure
 extract concepts and named entities
Technical detail
 NLP algorithms plugged via Apache UIMA
 Indexing time
  • UpdateProcessor plugin (solr/contrib/uima)
  • Custom tokenizers/filters
 Search time
  • Custom QParserPlugin
Why Apache UIMA?
   OASIS standard for UIM
   TLP since March 2010
   Deploy pipelines of Analysis Engines
   AEs wrap NLP algorithms
   Scaling capabilities
NLP and OSS
 Sentence Split
   • OpenNLP, UIMA Addons, StanfordNLP
 PoS tagging
   • OpenNLP, UIMA Addons, StanfordNLP
 Chunking/Parsing
   • OpenNLP, StanfordNLP
 NER
   • OpenNLP, UIMA Addons, Stanbol, StanfordNLP
 Clustering/Classifying
   • Mahout, OpenNLP, StanfordNLP
 ...
Solr NLS architecture
UIMA Update Processor
Lucene analysis & UIMA
 Type : denote lexical types for tokens
 Payload : a byte array stored at each term
  position
 tokenize / filter tokens covered by a certain
  annotation type
 store UIMA annotations’ features in types /
  payloads
UIMA type-aware tokenizer
Solr NLS QParser
 analyze user query
 extract (and query on) concepts / entities
 use types/PoS in the query for
  • boosting terms
  • synonim expansion
 search within sentences
 faceting / clustering using entities
 identify ‘place queries’ and expand Solr spatial
  queries (for filtering / boosting)
Scaling architecture
Performance
 basic (in memory)
  • slower with NRT indexing
  • search could be significantly impacted
 ReST (SimpleServer)
  • faster
  • need to explictly digest results
 UIMA-AS
  • fast also with NRT indexing
  • fast search
  • scales nicely with lots of data
DisMax vs NLS
Wrap up
   general purpose architecture
   generally improved recall / precision
   NLP algorithms accuracy make the difference
   lots of OSS alternatives
   performances can be kept good
Sources
 Resources
  • http://svn.apache.org/repos/asf/lucene/dev/trunk/
    solr/contrib/uima/
  • https://github.com/tteofili/le11-nls
 Links
  • http://wiki.apache.org/solr/SolrUIMA
  • http://googleblog.blogspot.com/2010/01/helping-
    computers-understand-language.html
Thanks
 http://www.sourcesense.com

 t.teofili@sourcesense.com

 @tteofili

Mais conteúdo relacionado

Mais de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mais de lucenerevolution (20)

Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Query Latency Optimization with Lucene
Query Latency Optimization with LuceneQuery Latency Optimization with Lucene
Query Latency Optimization with Lucene
 
10 keys to Solr's Future
10 keys to Solr's Future10 keys to Solr's Future
10 keys to Solr's Future
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
The Typed Index
The Typed IndexThe Typed Index
The Typed Index
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Natural Language Search in Solr - Tommaso Teofili

  • 1. Natural language search in Solr Tommaso Teofili, Sourcesense t.teofili@sourcesense.com, October 19th 2011
  • 2. Agenda  An approach to natural language search in Solr  Main points • Solr-UIMA integration module • Custom Lucene analyzers for UIMA • OSS NLP algorithms in Lucene/Solr • Orchestrating blocks to build a sample system able to understand natural language queries  Results
  • 3. My Background  Software engineer at Sourcesense • Enterprise search consultant  Member of the Apache Software Foundation • UIMA • Clerezza • Stanbol • DirectMemory • ...
  • 7. The Challenge  Improved recall/precision • ‘articles about science’ (concepts) • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’  Easier experience for non-expert users • ‘people working at Google’ - ‘cities near London’  Horizontal domains (e.g. Google)  Vertical domains
  • 8. Hurdles  understanding documents’ text/user queries  extract domain-specific/wide entities and concepts  index/search performance
  • 9. Use Case  search engine for an online movies magazine  Solr based  non technical users  time / cost • Solr 3.x setup : 2 mins • NLS setup / tweak : 5 days  expecting • improved recall / precision • more time (clicks) on site ($)
  • 11. General approach  Natural language processing  Processing documents at indexing time • document text analysis • write enriched text in (dedicated) fields • add custom types / payloads to terms  Processing queries at searching time • query analysis • higher boosts to entities/concepts • in-sentence search • ...
  • 12. NLP  AI discipline • Computers understanding and managing information written in human language  analyze text at various levels  incrementally enrich / give structure  extract concepts and named entities
  • 13. Technical detail  NLP algorithms plugged via Apache UIMA  Indexing time • UpdateProcessor plugin (solr/contrib/uima) • Custom tokenizers/filters  Search time • Custom QParserPlugin
  • 14. Why Apache UIMA?  OASIS standard for UIM  TLP since March 2010  Deploy pipelines of Analysis Engines  AEs wrap NLP algorithms  Scaling capabilities
  • 15. NLP and OSS  Sentence Split • OpenNLP, UIMA Addons, StanfordNLP  PoS tagging • OpenNLP, UIMA Addons, StanfordNLP  Chunking/Parsing • OpenNLP, StanfordNLP  NER • OpenNLP, UIMA Addons, Stanbol, StanfordNLP  Clustering/Classifying • Mahout, OpenNLP, StanfordNLP  ...
  • 18. Lucene analysis & UIMA  Type : denote lexical types for tokens  Payload : a byte array stored at each term position  tokenize / filter tokens covered by a certain annotation type  store UIMA annotations’ features in types / payloads
  • 20. Solr NLS QParser  analyze user query  extract (and query on) concepts / entities  use types/PoS in the query for • boosting terms • synonim expansion  search within sentences  faceting / clustering using entities  identify ‘place queries’ and expand Solr spatial queries (for filtering / boosting)
  • 22. Performance  basic (in memory) • slower with NRT indexing • search could be significantly impacted  ReST (SimpleServer) • faster • need to explictly digest results  UIMA-AS • fast also with NRT indexing • fast search • scales nicely with lots of data
  • 24. Wrap up  general purpose architecture  generally improved recall / precision  NLP algorithms accuracy make the difference  lots of OSS alternatives  performances can be kept good
  • 25. Sources  Resources • http://svn.apache.org/repos/asf/lucene/dev/trunk/ solr/contrib/uima/ • https://github.com/tteofili/le11-nls  Links • http://wiki.apache.org/solr/SolrUIMA • http://googleblog.blogspot.com/2010/01/helping- computers-understand-language.html