SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Improved Search with Lucene 4.0
            Robert Muir, Lucid Imagination
    robert.muir@lucidimagination.com, 10/19/2011
Introductions
 What: Examine improvements in 4.0
 Who am I?
  • Lucene Committer & PMC Member
  • Lucid Imagination Employee
 Many big changes coming!
 Examine three general areas:
  • Indexing Improvements
  • Search Improvements
  • Performance Improvements
 Future improvements

                     2
Indexing Improvements


 Codecs
  • Flexible index format
 Index DocValues
  • Efficient per-document lookup values
 Miscellaneous Improvements
  • Binary terms
  • Additional index statistics



                       3
Codecs: Introduction


 Problem: “one size fits all” index format
 How to integrate future improvements?
   • Example: more efficient index compression
   • But need to support old format, too.
 How to optimize for your data?
   • Without digging into the guts of indexer!
 Idea: format should be pluggable



                          4
Codecs: Example use case
 Accelerate near-realtime reopen
 Speed up delete by term on ID field
   • “Primary key lookup”
 Idea: optimize ID field for this use case




                            5
Codecs: available formats
   Standard: Lucene 4.0 index format
   Pulsing: inline low frequency terms' postings
   Memory: loads field entirely into RAM
   Appending: supports append-only filesystem
   SimpleText: plaintext (not for production!!!!!)
   PreFlex: supports Lucene 3.x index format




                          6
Codecs: Configuration
Lucene: use CodecProvider class
Solr: specify in schema.xml




                          7
Codecs: Configuration




            8
Index DocValues: Introduction


 Problem: lookup a per-document value
  • RAM-resident for things like scoring factors (e.g. pagerank)
  • Disk-resident for things like document versioning
 Existing workarounds in Lucene have limitations
  • Scoring limited to one byte norm
  • FieldCache requires uninversion to build
  • Stored fields are slow, geared towards summary results
 Idea: add performant, flexible per-document lookups.




                                 9
Index DocValues: Limitations


 New feature, recently added to Lucene

✔ External scoring factors
✔ Memory-resident and disk-based lookup
✗ Docvalues sort-by-term
✗ Internal scoring factors (die norms, die)
✗ Integration with Solr



                         10
Index DocValues: Example




             11
Binary Terms


 In Lucene 4.0, index terms are byte[]
  • Do not need to be unicode text
 Example: Localized Sort and Range
  • Previous versions of Lucene: special encoder
  • Lucene 4.0: 50% space savings




                        12
Localized Sort/Range Example
Lucene: use CollationAnalyzer class
Solr: specify in schema.xml




                          13
Additional Index Statistics


 New statistics in Lucene 4.0:
  •   totalTermFreq(term): number of occurrences
  •   sumTotalTermFreq(field): number of tokens
  •   sumDocFreq(field): number of postings
  •   docCount(field): number of docs with value
 Available from Lucene APIs
 Available from function queries
 Supports additional scoring algorithms...



                            14
Search Improvements


 Improved Scoring API
  • Additional Scoring algorithms
 Spellchecking improvements
 Miscellaneous
  • Query parsing improvements
  • Deep paging support




                      15
Scoring: Introduction


 Problem: “baked-in” vector space model
 How to integrate additional algorithms?
  • Example: Language Models
  • Before 4.0: write custom Queries
  • Before 4.0: track certain statistics yourself
 How to customize for your data?
  • Without digging into the guts of postings lists!
 Idea: separate “matching” from “scoring”



                             16
Scoring: additional algorithms


      BM25
      Language Models
      Divergence from Randomness
      Information-based Models




                    17
Scoring: Configuration
Lucene: use Similarity class
Solr: specify in schema.xml




                          18
Spellchecking Improvements


 New DirectSpellChecker
  • No additional index needed
 Better Suggestions
  • Levenshtein vs. n-gram
 Exposes more configuration options
  • Tune to your collection




                       19
Spellchecking: Configuration
Lucene: use DirectSpellChecker class
Solr: specify in solrconfig.xml




                          20
Queryparsing Improvements

 Regular expression queries
  • /mycompany.(com|org|net)/
 Specify number of edits for fuzzy
  • foobar~2
 Wildcard escaping
  • crazy?*
 Range query syntax improvements
  • Mix inclusive/exclusive bounds
  • Support open-ended syntax in Lucene


                          21
Deep-paging support

 Problem: users who page deep into results :)
 Normal paging in lucene:
  • Page 1: search(“foo”, 20) ← look at 1-20
  • Page 2: search(“foo”, 40) ← look at 21-40
 Deep paging in lucene:
  • Page 1: searchAfter(null, “foo”, 20)
  • Page 2: searchAfter(lastResult, “foo”, 20)




                           22
Performance Improvements



 Concurrent Flushing
 Fast Fuzzy Query
 Improved RAM Efficiency




                   23
Concurrent Flushing




http://people.apache.org/~mikemccand/lucenebench/
Fast Fuzzy Query




In Lucene 3 this thing is < 1 QPS!
Improved RAM Efficiency
 Lucene 4.0 uses much less memory
  • Terms Index, Suggester, Synonyms : finite-state
  • Fieldcache: packed integers / utf-8
 Example with Wikipedia collection:
  “Memory footprint reduction from 389M to 90M
  after some off-the wall sorting and faceting”
Future Improvements
 Block Index Compression
  • PFOR-delta, Simple8b, …
 Positional iterators from Scorers
  • Offsets in postings lists (fast highlighting)
  • Proximity Scoring
 Structured/Section Scoring (e.g. BM25F)
 Improved flexibility through Codec
  • stored fields, term vectors, …
 Faster filtered search


                           27
Conclusion
 Lucene 4.0 will have many improvements,
  architectural, and API changes.
 A few of these were introduced here:
  • Ability to customize the index format
  • Additional scoring algorithms
  • Performance Improvements
 More are currently under development.
 For more information, look at CHANGES.txt in
  subversion, JIRA, ...


                          28
More Information

 Lucene Website
  • http://lucene.apache.org
 Users List
  • java-user@lucene.apache.org
 Lucene Revolution presentations
  •   http://www.lucidimagination.com/devzone/events/conferences/revolution/2010
  •   http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

 Contact Info
  •   robert.muir@lucidimagination.com
  •   http://www.lucidimagination.com/blog/author/robert-muir



                                         29

Mais conteúdo relacionado

Mais procurados

Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Mais procurados (20)

Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Search at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, TwitterSearch at Twitter: Presented by Michael Busch, Twitter
Search at Twitter: Presented by Michael Busch, Twitter
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 

Semelhante a Improved Search with Lucene 4.0 - Robert Muir

Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept Location
Sonia Haiduc
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 

Semelhante a Improved Search with Lucene 4.0 - Robert Muir (20)

Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Pharo 10 and beyond
 Pharo 10 and beyond Pharo 10 and beyond
Pharo 10 and beyond
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
 
VA Smalltalk Update
VA Smalltalk UpdateVA Smalltalk Update
VA Smalltalk Update
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
Budapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupBudapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User Group
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Combining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept LocationCombining IR with Relevance Feedback for Concept Location
Combining IR with Relevance Feedback for Concept Location
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature Preview
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
SplunkLive London 2014 Developer Presentation
SplunkLive London 2014  Developer PresentationSplunkLive London 2014  Developer Presentation
SplunkLive London 2014 Developer Presentation
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2
 
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JSCross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
Cross Site Collection Navigation using SPFx, Powershell PnP & PnP-JS
 

Mais de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Improved Search with Lucene 4.0 - Robert Muir

  • 1. Improved Search with Lucene 4.0 Robert Muir, Lucid Imagination robert.muir@lucidimagination.com, 10/19/2011
  • 2. Introductions  What: Examine improvements in 4.0  Who am I? • Lucene Committer & PMC Member • Lucid Imagination Employee  Many big changes coming!  Examine three general areas: • Indexing Improvements • Search Improvements • Performance Improvements  Future improvements 2
  • 3. Indexing Improvements  Codecs • Flexible index format  Index DocValues • Efficient per-document lookup values  Miscellaneous Improvements • Binary terms • Additional index statistics 3
  • 4. Codecs: Introduction  Problem: “one size fits all” index format  How to integrate future improvements? • Example: more efficient index compression • But need to support old format, too.  How to optimize for your data? • Without digging into the guts of indexer!  Idea: format should be pluggable 4
  • 5. Codecs: Example use case  Accelerate near-realtime reopen  Speed up delete by term on ID field • “Primary key lookup”  Idea: optimize ID field for this use case 5
  • 6. Codecs: available formats  Standard: Lucene 4.0 index format  Pulsing: inline low frequency terms' postings  Memory: loads field entirely into RAM  Appending: supports append-only filesystem  SimpleText: plaintext (not for production!!!!!)  PreFlex: supports Lucene 3.x index format 6
  • 7. Codecs: Configuration Lucene: use CodecProvider class Solr: specify in schema.xml 7
  • 9. Index DocValues: Introduction  Problem: lookup a per-document value • RAM-resident for things like scoring factors (e.g. pagerank) • Disk-resident for things like document versioning  Existing workarounds in Lucene have limitations • Scoring limited to one byte norm • FieldCache requires uninversion to build • Stored fields are slow, geared towards summary results  Idea: add performant, flexible per-document lookups. 9
  • 10. Index DocValues: Limitations  New feature, recently added to Lucene ✔ External scoring factors ✔ Memory-resident and disk-based lookup ✗ Docvalues sort-by-term ✗ Internal scoring factors (die norms, die) ✗ Integration with Solr 10
  • 12. Binary Terms  In Lucene 4.0, index terms are byte[] • Do not need to be unicode text  Example: Localized Sort and Range • Previous versions of Lucene: special encoder • Lucene 4.0: 50% space savings 12
  • 13. Localized Sort/Range Example Lucene: use CollationAnalyzer class Solr: specify in schema.xml 13
  • 14. Additional Index Statistics  New statistics in Lucene 4.0: • totalTermFreq(term): number of occurrences • sumTotalTermFreq(field): number of tokens • sumDocFreq(field): number of postings • docCount(field): number of docs with value  Available from Lucene APIs  Available from function queries  Supports additional scoring algorithms... 14
  • 15. Search Improvements  Improved Scoring API • Additional Scoring algorithms  Spellchecking improvements  Miscellaneous • Query parsing improvements • Deep paging support 15
  • 16. Scoring: Introduction  Problem: “baked-in” vector space model  How to integrate additional algorithms? • Example: Language Models • Before 4.0: write custom Queries • Before 4.0: track certain statistics yourself  How to customize for your data? • Without digging into the guts of postings lists!  Idea: separate “matching” from “scoring” 16
  • 17. Scoring: additional algorithms  BM25  Language Models  Divergence from Randomness  Information-based Models 17
  • 18. Scoring: Configuration Lucene: use Similarity class Solr: specify in schema.xml 18
  • 19. Spellchecking Improvements  New DirectSpellChecker • No additional index needed  Better Suggestions • Levenshtein vs. n-gram  Exposes more configuration options • Tune to your collection 19
  • 20. Spellchecking: Configuration Lucene: use DirectSpellChecker class Solr: specify in solrconfig.xml 20
  • 21. Queryparsing Improvements  Regular expression queries • /mycompany.(com|org|net)/  Specify number of edits for fuzzy • foobar~2  Wildcard escaping • crazy?*  Range query syntax improvements • Mix inclusive/exclusive bounds • Support open-ended syntax in Lucene 21
  • 22. Deep-paging support  Problem: users who page deep into results :)  Normal paging in lucene: • Page 1: search(“foo”, 20) ← look at 1-20 • Page 2: search(“foo”, 40) ← look at 21-40  Deep paging in lucene: • Page 1: searchAfter(null, “foo”, 20) • Page 2: searchAfter(lastResult, “foo”, 20) 22
  • 23. Performance Improvements  Concurrent Flushing  Fast Fuzzy Query  Improved RAM Efficiency 23
  • 25. Fast Fuzzy Query In Lucene 3 this thing is < 1 QPS!
  • 26. Improved RAM Efficiency  Lucene 4.0 uses much less memory • Terms Index, Suggester, Synonyms : finite-state • Fieldcache: packed integers / utf-8  Example with Wikipedia collection: “Memory footprint reduction from 389M to 90M after some off-the wall sorting and faceting”
  • 27. Future Improvements  Block Index Compression • PFOR-delta, Simple8b, …  Positional iterators from Scorers • Offsets in postings lists (fast highlighting) • Proximity Scoring  Structured/Section Scoring (e.g. BM25F)  Improved flexibility through Codec • stored fields, term vectors, …  Faster filtered search 27
  • 28. Conclusion  Lucene 4.0 will have many improvements, architectural, and API changes.  A few of these were introduced here: • Ability to customize the index format • Additional scoring algorithms • Performance Improvements  More are currently under development.  For more information, look at CHANGES.txt in subversion, JIRA, ... 28
  • 29. More Information  Lucene Website • http://lucene.apache.org  Users List • java-user@lucene.apache.org  Lucene Revolution presentations • http://www.lucidimagination.com/devzone/events/conferences/revolution/2010 • http://www.lucidimagination.com/devzone/events/conferences/revolution/2011  Contact Info • robert.muir@lucidimagination.com • http://www.lucidimagination.com/blog/author/robert-muir 29