SlideShare a Scribd company logo
1 of 34
Download to read offline
Lucene today, tomorrow and beyond
                Simon Willnauer
                Apache Lucene Core Committer & PMC Chair
                simonw@apache.org / simon.willnauer@searchworkings.org




Thursday, October 20, 2011
Who am I?



       • Lucene Core Committer
       • Project Management Committee Chair (PMC)
       • Apache Member
       • BerlinBuzzwords Co-Founder
       • Addicted to OpenSource
       • Apache Solr & Lucene User / Consultant / Promoter




                                                             2

Thursday, October 20, 2011
http://www.searchworkings.org

       • Community Portal targeting OpenSource Search




                                                        3

Thursday, October 20, 2011
What makes this talk different?




       • The most of the talks here are presenting what Lucene can do or what
          people do with Lucene, right?

       • This talk will show what Lucene can’t do today (trunk) but might be
          doing in the future.

       • I won’t talk about what people going to do in the future - maybe next
          time :)




                                                                                 4

Thursday, October 20, 2011
Lu
                                         2001           ce
                                                             ne
                                                                  jo
                                         2002                          in
                                                                            ed




Thursday, October 20, 2011
                                                                                 th
                                                                                    e
                                                   Lu                                   AS
                                         2003           ce                                 F
                                                             ne
                                                                  be
                                         2004      Lu                  co
                                                        ce                   m
                                                                                 es
                                                             ne                       Ap
                                                   Lu
                                         2005           ce
                                                                  1.
                                                                    2                      ac
                                                             ne                                he
                                                                  1.                                TL
                                         2006      Lu               4                                  P
                                                        ce
                                                                                                           Let’s go back in time a bit




                                                             ne
                                                                  2.
                                         2007      Lu               0
                                                        ce
                                                             ne
                                         2008      Lu       2.
                                                              1
                                                      ce
                                                   Lu    ne & 2
                                                      ce 2.           .2
                                         2009            ne 3
                                                            2.
                                                   Lu          4
                                         2010         ce
                                                         ne
                                                   Lu       2.
                                                      ce       9
                                         2011            ne      &
                                                                     3.
                                 Happy Birthday!   Lu       &          0
                                                      ce       So
                                          2012                    lr
                                                         ne          M
                                                            3.          er
                                                               1          ge
                                                   Lu            -3
                                                      ce             .4
                                                         ne
                                                            4.
                                                               0
                                                                 ?
                             5




                                          2014
And who did all the work?




                                                                                                            Created from Lucene core CHANGES.TXT




 Especially “via” is interesting since we use this for contributions from non-committers (FooBar via $committer_name)                              6

Thursday, October 20, 2011
Lets make this a fair game!




                             28 committers from 8 different countries

                                                                        7

Thursday, October 20, 2011
And the companies




                             8

Thursday, October 20, 2011
Where are we now - once 4.0 is out?



       • Lucene 4.0 contains a ton of smallish improvements
       • Lots of refined APIs
       • Large speed improvements
       • New modules
       • And lots of paths to explore for the future!




                                                              9

Thursday, October 20, 2011
Some random improvements

       • FuzzyQuery speedup by 20000% (yes 20k!)
       • Indexing throughput improvements 200% to 280%
       • Document Filtering speedup up to 480%
       • Loading term dictionaries up to 30x faster using 10% of the memory
          compared to 3.x

       • 600000 key-value lookups/second
       • Tremendous reduction of GC needs at runtime


                      Your mileage may vary!
                                                                              10

Thursday, October 20, 2011
Flexible Indexing & Codecs

       • Allows to customize low level index structure per field
       • Yields significant performance gains depending on the use-case
       • Highly optimized data-structures
       • Allows future improvements due to per codec Backwards Compatibility
       • Lets you decide on memory consumption




                                                                               11

Thursday, October 20, 2011
IndexDocValues

       • Value per field & document - similar to FieldCache
       • Type-safe and efficient on-disk & in-memory access
       • Soon update-able
       • More flexible than FieldCache
       • Fast loading times




                                                              12

Thursday, October 20, 2011
Flexible Scoring

       • New ranking models in addition to VSM
       • Adds key statistics to Lucene index to support other scoring models
       • Decoupled matching from ranking
       • Powerful Similarity API (can use IndexDocValues)




                                                                               13

Thursday, October 20, 2011
What else?

       • DocumentWriterPerThread
          • High throughput incremental indexing
          • Preparation for RT-Search
       • AutomatonQuery (FuzzyQuery)
          • Query as s Deterministic Finite Automata (DFA)
          • Levenshtein Automata for fast Fuzzy Queries (up to 20000%
                improvement over 3.x)

             • Flexible Automata concatenation



                                                                        14

Thursday, October 20, 2011
This was what we get with Lucene 4.0 (roughly)

       • What is missing in this picture?
       • Where are we going?
       • What comes after 4.0?
       • What is not going to make it into 4.0?


       All this boils down to: “What do WE & YOU want
       Lucene to become in the future?”




                                                        15

Thursday, October 20, 2011
Lucene - a Full Text Search Library




                     CORE SEARCH
                  FEATURES! - LIMITATIONS?



                                             16

Thursday, October 20, 2011
Positions - not a first class citizen

       • We have:
          • Spans (Near, First, MultiTerm...)
          • PhraseQuery (sloppy & strict)
       • The Problem:
          • Either use “common” query hierarchy or Spans
          • Score ALL or NOTHING
          • Scoring lots of documents takes ages




                                                           17

Thursday, October 20, 2011
Positions - not a first class citizen

       • Solutions?
          • Multi-Phase searches
             • Collect documents without positions
             • Re-score top N based on position data
          • Query hierarchy can be complex
             • We need an API with the same granularity as Scorer
          • Span semantics should not be bound to a query
             • Divorce scoring & matching for positions


                                                                    18

Thursday, October 20, 2011
Positions - not a first class citizen

       • What about highlighting?
          • The implementation is a mess
          • Tons of If (query instanceof FooQuery)
          • Hard to extend for custom queries
       • First steps are already taken!
          • http://svn.apache.org/repos/asf/lucene/dev/branches/positions/
          • Scorer allows to pull positions for any query - Help Wanted!




                                                                             19

Thursday, October 20, 2011
Updates - Huh? Incremental you know!

       • Everybody wants it, right?
          • Updating a field without reindexing the entire doc? Yeah!
          • Watch out, this comes not for free!
       • You can’t simply update a field - it’s a reverse index!
          • Term -> [ (docID, freq) ] ( how to update this )
          • Lucene is write once - no in-place updates (which is good!)
       • We have write per field per segment deltas and merge them on
          IndexReader open?! - seems tricky?

       • Lots of paths need to be explored - maybe “appending fields”?

                                                                          20

Thursday, October 20, 2011
Updates - Huh? Incremental you know!
                   term      fre   Posting list   1   The old night keeper keeps the keep in the town
                    and       q
                              1    6              2   In the big old house in the big old gown.
                     big      2    23
                                                  3   The house in the town had the big old keep
                   dark       1    6
                     did      1    4              4   Where the old night keeper never did sleep.

                   gown       1    2              5   The night keeper keeps the keep in the night
                    had       1    3              6   And keeps in the dark and sleeps in the light.
                  house       2    23
                      in      5    12356
                   keep
                  keeper
                              3
                              3
                                   135
                                   145
                                                           update freq & postings
                  keeps       3    156
                                                  2   In the small old house in the big old gown.
                    light     1    6
                   never      1    4
                   night      3    145                     insert new term
                     old      4    1234
                   sleep      1    4
                  sleeps      1    6
                     the      6    123456
                   town       2    13
                  where       1    4




                                                                                                        21

Thursday, October 20, 2011
Updates - Hu? Incremental you know!

       • Much easier (and closer) for not-indexed values
          • IndexDocValues
       • Assumption:
          • Document Title OR Body changes are low frequent
          • PageRank OR User-Ratings change very frequently
       • Maybe available in 4.0
       • Bottom Line: this is still far away but on the list!




                                                                22

Thursday, October 20, 2011
The JVM - or is it the JIT?

       • Unpredictable Mr. JIT




                             Grouping benchmark changes Spans? WTF?

                                                                  23

Thursday, October 20, 2011
The JVM - or is it the JIT?

       • The cost of a virtual method call




                                ConjunctionScorer Code Specialization
                                                                   24

Thursday, October 20, 2011
The JVM - or is it the JIT?

       • Lucene has a lot of HOT loops
          • Each TermScorer needs DocID & TermFreq for every possible hit
          • Calling DocsEnum#next() & #freq() adds up
          • Inlining seems unreliable

       • Solutions?




                                                                            25

Thursday, October 20, 2011
Possible Solutions / Paths to explore

       • Native Code / Generation (thats gonna be fun!)
       • Code Specialization
          • Can bring 50% to 100% performance improvements
       • ByteCode Generation & Query Compilation
          • Prototypes for FunctionQuery yields 300% speed improvements
       • Bulk Reading APIs - BulkPostings branch - watch out its hairy
          • Reading more than one DocID / TermFreq at a time
          • More than one step backwards - API wise


                                                                          26

Thursday, October 20, 2011
ByteCode generation

       • Specializing Queries at Runtime?
          • Might bring nice speed improvements per use-case
          • Problems arise with testing and correctness?
       • Could help tremendously with bulk postings
          • Some people say the API is unusable (Uwe?)
          • Maybe you don’t need to use it at all?
          • Would be nice if you could specify you query on a very high level and
                Lucene generates optimal code for you?




                                                                                27

Thursday, October 20, 2011
The Future beyond the core



       • Users have two options
          • Nothing - plain Lucene (well its a lot already - a lot to code)
          • All - Solr / ElasticSearch etc.


       •I’d like something in between, you?



                                                                              28

Thursday, October 20, 2011
<dream>Lucene 5.0</dream>

       • actually, XML is backwards: { “dream” : “Lucene 5.0” }
       • Solr has grown, grown large and is showing its age!
       • 95% of the time I only want one or two “services” Solr provides
          • still I got to use it - all or nothing!
          • I have to setup a (to me) heavy weight container (5 years ago Jetty /
                Tomcat was lightweight - times ‘r changing)

             • I got to figure out this documentation - fair enough!




                                                                                    29

Thursday, October 20, 2011
{“dream” : “Lucene 5.0”}

       • Can we get this more modular, lightweight & lean?
          • I rather do some coding than configure 2 lines of XML, you?




          Suggestions                                               Replication
                                              Faceting


                                                         Modules
                                                                                           CoreUtils
                                 Grouping



                 Spellchecking                                     Durability / Recovery


                                            Join

                                   today                           tomorrow
                                                                                                       30

Thursday, October 20, 2011
Isn’t this what Solr is?

       • Not quiet!
          • Lucene tries to provide APIs where you hardly can’t take anything
                away

             • When I think of Solr, you can hardly add anything
       • Everybody should be able to build their own $Solr
       • How hard will it be to draw the line?
       • Who is going to benefit?




                                                                                31

Thursday, October 20, 2011
Back to {“dream” : “Lucene 5.0”}

       • Can we go one step further?
                                                      Service - Module
                                          HTTP - Module




       • ElasticSearch did a great job making things dead simple!
          • we should follow this example and less might be more eventually!
       • Taking it as far as ElasticSearch (all or nothing again) seems not the
          right path for Lucene but simple is good, no?




                                                                                  32

Thursday, October 20, 2011
Disclaimer




       • This was my personal vision maybe not the one other people have.

       • Lets see what the community wants / needs - It’s all about the users!




                                                                                 33

Thursday, October 20, 2011
Questions




                             Thank you!



                                          34

Thursday, October 20, 2011

More Related Content

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Willnauer today tomorrow_and_beyond_eurocon2011

  • 1. Lucene today, tomorrow and beyond Simon Willnauer Apache Lucene Core Committer & PMC Chair simonw@apache.org / simon.willnauer@searchworkings.org Thursday, October 20, 2011
  • 2. Who am I? • Lucene Core Committer • Project Management Committee Chair (PMC) • Apache Member • BerlinBuzzwords Co-Founder • Addicted to OpenSource • Apache Solr & Lucene User / Consultant / Promoter 2 Thursday, October 20, 2011
  • 3. http://www.searchworkings.org • Community Portal targeting OpenSource Search 3 Thursday, October 20, 2011
  • 4. What makes this talk different? • The most of the talks here are presenting what Lucene can do or what people do with Lucene, right? • This talk will show what Lucene can’t do today (trunk) but might be doing in the future. • I won’t talk about what people going to do in the future - maybe next time :) 4 Thursday, October 20, 2011
  • 5. Lu 2001 ce ne jo 2002 in ed Thursday, October 20, 2011 th e Lu AS 2003 ce F ne be 2004 Lu co ce m es ne Ap Lu 2005 ce 1. 2 ac ne he 1. TL 2006 Lu 4 P ce Let’s go back in time a bit ne 2. 2007 Lu 0 ce ne 2008 Lu 2. 1 ce Lu ne & 2 ce 2. .2 2009 ne 3 2. Lu 4 2010 ce ne Lu 2. ce 9 2011 ne & 3. Happy Birthday! Lu & 0 ce So 2012 lr ne M 3. er 1 ge Lu -3 ce .4 ne 4. 0 ? 5 2014
  • 6. And who did all the work? Created from Lucene core CHANGES.TXT Especially “via” is interesting since we use this for contributions from non-committers (FooBar via $committer_name) 6 Thursday, October 20, 2011
  • 7. Lets make this a fair game! 28 committers from 8 different countries 7 Thursday, October 20, 2011
  • 8. And the companies 8 Thursday, October 20, 2011
  • 9. Where are we now - once 4.0 is out? • Lucene 4.0 contains a ton of smallish improvements • Lots of refined APIs • Large speed improvements • New modules • And lots of paths to explore for the future! 9 Thursday, October 20, 2011
  • 10. Some random improvements • FuzzyQuery speedup by 20000% (yes 20k!) • Indexing throughput improvements 200% to 280% • Document Filtering speedup up to 480% • Loading term dictionaries up to 30x faster using 10% of the memory compared to 3.x • 600000 key-value lookups/second • Tremendous reduction of GC needs at runtime Your mileage may vary! 10 Thursday, October 20, 2011
  • 11. Flexible Indexing & Codecs • Allows to customize low level index structure per field • Yields significant performance gains depending on the use-case • Highly optimized data-structures • Allows future improvements due to per codec Backwards Compatibility • Lets you decide on memory consumption 11 Thursday, October 20, 2011
  • 12. IndexDocValues • Value per field & document - similar to FieldCache • Type-safe and efficient on-disk & in-memory access • Soon update-able • More flexible than FieldCache • Fast loading times 12 Thursday, October 20, 2011
  • 13. Flexible Scoring • New ranking models in addition to VSM • Adds key statistics to Lucene index to support other scoring models • Decoupled matching from ranking • Powerful Similarity API (can use IndexDocValues) 13 Thursday, October 20, 2011
  • 14. What else? • DocumentWriterPerThread • High throughput incremental indexing • Preparation for RT-Search • AutomatonQuery (FuzzyQuery) • Query as s Deterministic Finite Automata (DFA) • Levenshtein Automata for fast Fuzzy Queries (up to 20000% improvement over 3.x) • Flexible Automata concatenation 14 Thursday, October 20, 2011
  • 15. This was what we get with Lucene 4.0 (roughly) • What is missing in this picture? • Where are we going? • What comes after 4.0? • What is not going to make it into 4.0? All this boils down to: “What do WE & YOU want Lucene to become in the future?” 15 Thursday, October 20, 2011
  • 16. Lucene - a Full Text Search Library CORE SEARCH FEATURES! - LIMITATIONS? 16 Thursday, October 20, 2011
  • 17. Positions - not a first class citizen • We have: • Spans (Near, First, MultiTerm...) • PhraseQuery (sloppy & strict) • The Problem: • Either use “common” query hierarchy or Spans • Score ALL or NOTHING • Scoring lots of documents takes ages 17 Thursday, October 20, 2011
  • 18. Positions - not a first class citizen • Solutions? • Multi-Phase searches • Collect documents without positions • Re-score top N based on position data • Query hierarchy can be complex • We need an API with the same granularity as Scorer • Span semantics should not be bound to a query • Divorce scoring & matching for positions 18 Thursday, October 20, 2011
  • 19. Positions - not a first class citizen • What about highlighting? • The implementation is a mess • Tons of If (query instanceof FooQuery) • Hard to extend for custom queries • First steps are already taken! • http://svn.apache.org/repos/asf/lucene/dev/branches/positions/ • Scorer allows to pull positions for any query - Help Wanted! 19 Thursday, October 20, 2011
  • 20. Updates - Huh? Incremental you know! • Everybody wants it, right? • Updating a field without reindexing the entire doc? Yeah! • Watch out, this comes not for free! • You can’t simply update a field - it’s a reverse index! • Term -> [ (docID, freq) ] ( how to update this ) • Lucene is write once - no in-place updates (which is good!) • We have write per field per segment deltas and merge them on IndexReader open?! - seems tricky? • Lots of paths need to be explored - maybe “appending fields”? 20 Thursday, October 20, 2011
  • 21. Updates - Huh? Incremental you know! term fre Posting list 1 The old night keeper keeps the keep in the town and q 1 6 2 In the big old house in the big old gown. big 2 23 3 The house in the town had the big old keep dark 1 6 did 1 4 4 Where the old night keeper never did sleep. gown 1 2 5 The night keeper keeps the keep in the night had 1 3 6 And keeps in the dark and sleeps in the light. house 2 23 in 5 12356 keep keeper 3 3 135 145 update freq & postings keeps 3 156 2 In the small old house in the big old gown. light 1 6 never 1 4 night 3 145 insert new term old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4 21 Thursday, October 20, 2011
  • 22. Updates - Hu? Incremental you know! • Much easier (and closer) for not-indexed values • IndexDocValues • Assumption: • Document Title OR Body changes are low frequent • PageRank OR User-Ratings change very frequently • Maybe available in 4.0 • Bottom Line: this is still far away but on the list! 22 Thursday, October 20, 2011
  • 23. The JVM - or is it the JIT? • Unpredictable Mr. JIT Grouping benchmark changes Spans? WTF? 23 Thursday, October 20, 2011
  • 24. The JVM - or is it the JIT? • The cost of a virtual method call ConjunctionScorer Code Specialization 24 Thursday, October 20, 2011
  • 25. The JVM - or is it the JIT? • Lucene has a lot of HOT loops • Each TermScorer needs DocID & TermFreq for every possible hit • Calling DocsEnum#next() & #freq() adds up • Inlining seems unreliable • Solutions? 25 Thursday, October 20, 2011
  • 26. Possible Solutions / Paths to explore • Native Code / Generation (thats gonna be fun!) • Code Specialization • Can bring 50% to 100% performance improvements • ByteCode Generation & Query Compilation • Prototypes for FunctionQuery yields 300% speed improvements • Bulk Reading APIs - BulkPostings branch - watch out its hairy • Reading more than one DocID / TermFreq at a time • More than one step backwards - API wise 26 Thursday, October 20, 2011
  • 27. ByteCode generation • Specializing Queries at Runtime? • Might bring nice speed improvements per use-case • Problems arise with testing and correctness? • Could help tremendously with bulk postings • Some people say the API is unusable (Uwe?) • Maybe you don’t need to use it at all? • Would be nice if you could specify you query on a very high level and Lucene generates optimal code for you? 27 Thursday, October 20, 2011
  • 28. The Future beyond the core • Users have two options • Nothing - plain Lucene (well its a lot already - a lot to code) • All - Solr / ElasticSearch etc. •I’d like something in between, you? 28 Thursday, October 20, 2011
  • 29. <dream>Lucene 5.0</dream> • actually, XML is backwards: { “dream” : “Lucene 5.0” } • Solr has grown, grown large and is showing its age! • 95% of the time I only want one or two “services” Solr provides • still I got to use it - all or nothing! • I have to setup a (to me) heavy weight container (5 years ago Jetty / Tomcat was lightweight - times ‘r changing) • I got to figure out this documentation - fair enough! 29 Thursday, October 20, 2011
  • 30. {“dream” : “Lucene 5.0”} • Can we get this more modular, lightweight & lean? • I rather do some coding than configure 2 lines of XML, you? Suggestions Replication Faceting Modules CoreUtils Grouping Spellchecking Durability / Recovery Join today tomorrow 30 Thursday, October 20, 2011
  • 31. Isn’t this what Solr is? • Not quiet! • Lucene tries to provide APIs where you hardly can’t take anything away • When I think of Solr, you can hardly add anything • Everybody should be able to build their own $Solr • How hard will it be to draw the line? • Who is going to benefit? 31 Thursday, October 20, 2011
  • 32. Back to {“dream” : “Lucene 5.0”} • Can we go one step further? Service - Module HTTP - Module • ElasticSearch did a great job making things dead simple! • we should follow this example and less might be more eventually! • Taking it as far as ElasticSearch (all or nothing again) seems not the right path for Lucene but simple is good, no? 32 Thursday, October 20, 2011
  • 33. Disclaimer • This was my personal vision maybe not the one other people have. • Lets see what the community wants / needs - It’s all about the users! 33 Thursday, October 20, 2011
  • 34. Questions Thank you! 34 Thursday, October 20, 2011