SlideShare uma empresa Scribd logo
1 de 31
Apache Lucene Eurocon: Preview
                         www.lucene-eurocon.org



Apache Lucene EuroCon                              20 May 2010
Overview                                 A link to download these
                                           slides will be available after
                                           the webcast is complete. An
  • Introduction                           on-demand replay will be
                                           ready in ~48 hours.

  • Near Real Time Search: Yonik Seeley

  • Munching & Crunching: Andrzej Białecki

  • Solr in the Cloud: Mark Miller

  • Practical Relevance: Grant Ingersoll

  • Q&A

Apache Lucene EuroCon   20 May 2010                                         2
Near Real Time Search


                               Yonik Seeley


Apache Lucene EuroCon                           20 May 2010
Near Real-Time Search
        Shorter times until updates are searchable/visible

        Lucene 2.9 first laid the groundwork w/ per-segment searching
                Per-segment FieldCache entries for sorting and FunctionQueries
                NRT IndexWriter.getReader()
                        Make new segments available before merging is done in background
                        Doesn’t cause commit/fsync first

        Solr still needs
                Per-segment faceting
                Per-segment caching
                Per-segment statistics (and anything else that uses FieldCache)


Apache Lucene EuroCon      20 May 2010                                                     4
Existing single-values faceting algorithm
                             Documents matching the
                                                           Lucene FieldCache Entry
                             base query “Juggernaut”
                                                           (StringIndex) for the “hero” field
 q=Juggernaut                             0               order: for each
 &facet=true                              2               doc, an index into   lookup: the
                                              lookup
 &facet.field=hero                        7               the lookup array     string values
                                                                 5                (null)
                                      accumulator                3              batman
                                                                 5                flash
                                          0
                                                                 1             spiderman
                                          1
                                                                 4             superman
                                          0   increment
                                                                 5             wolverine
                                          0
                                                                 2
                                          0
                                                                 1
                                          2
Apache Lucene EuroCon   20 May 2010                                                             5
Per-segment single-valued faceting algorithm
                        Segment1             Segment2            Segment3          Segment4
                        FieldCache           FieldCache          FieldCache        FieldCache
                           Entry                Entry               Entry             Entry
                              accumulator1    accumulator2          accumulator3      accumulator4
                        inc
          lookup               0                0                   1                  0
                               3                2                   3                  1
         0
                               5                1                   0                  0
         2
                               0                0                   4
         7                                                                          thread4
                               1             thread2            thread3
       Base                    2
       DocSet
                          thread1                      FieldCache +                Priority queue
                                                       accumulator                    flash, 5
                                                                                    Batman, 3
                                                       merger
                                                       (Priority queue)
Apache Lucene EuroCon     20 May 2010                                                               6
Per-segment faceting
          Enable with facet.method=fcs

          Controllable multi-threading
                  facet.field={!threads=4}myfield

          Disadvantages
                  Larger memory use (FieldCaches + accumulators)
                  Slower (extra FieldCache merge step needed)

          Advantages
                  Rebuilds FieldCache entries only for new segments (NRT friendly)
                  Multi-threaded

Apache Lucene EuroCon   20 May 2010                                                  7
Per-segment faceting performance comparison
        Test index: 10M documents, 18 segments, single valued field

        Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

A       Time for request*             facet.method=fc           facet.method=fcs
        static index                  3 ms                      244 ms
        quickly changing index        1388 ms                   267 ms


        Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

B       Time for request*             facet.method=fc           facet.method=fcs
        static index                  26 ms                     34 ms
        quickly changing index        741 ms                    94 ms

                            *complete request time, measured externally
Apache Lucene EuroCon   20 May 2010                                                8
9




                        Munching & Crunching
                        Lucene index post-processing and applications




                                      Andrzej Białecki

                          <andrzej.bialecki@lucidimagination.com>

Apache Lucene EuroCon                                                   20 May 2010
Munching & Crunching Agenda
                 Post-processing
                        Splitting, merging, sorting, pruning

                 Tiered search

                 Bitwise search

                 Map-reduce indexing models




Apache Lucene EuroCon        20 May 2010                       10
Post-processing
     Isn't it better to build it right from the start?

     Some parameters are difficult to get right...
                  Minimizing index size while retaining search quality
                  Correcting impact of unexpected common words
                  Creating evenly-sized shards

      ...perhaps impossible to get at all during indexing
                  Adding collection-wide factors not computed by Lucene (e.g. avg. length)
                  Optimizing top-N results for common queries
                  Fitting too large indexes in RAM


Apache Lucene EuroCon    20 May 2010                                                          11
Merging, splitting, sorting, pruning
     Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter 

     Sorting postings by impact and “early termination” search

     Index pruning:
         What data to remove and how?
         Pruning strategies
         Challenges




Apache Lucene EuroCon   20 May 2010                                     12
Tiered search
     Assuming we CAN prune effectively, while maintaining good
      search quality...
                                               search box
                                               RAM
                                               70% pruned


                                             SSD
                                         30% pruned         ?
                                       HDD

                                      0% pruned


Apache Lucene EuroCon   20 May 2010                               13
Tiered search
     Assuming we CAN prune effectively, while maintaining good
      search quality...
 search box 1
                                               RAM
                                               70% pruned

 search box 2
                                             SSD
                                         30% pruned         ?
 search box 3
                                       HDD

                                      0% pruned

Apache Lucene EuroCon   20 May 2010                               14
Bit-wise search
     Given a bit pattern query:
                 1010 1001 0101 0001

     Find best matching bit patterns in documents

     Applications:
         Fuzzy “fingerprinting”
         De-duplication
         Plagiarism detection

     BitwiseSearcher and Solr BitwiseField design

Apache Lucene EuroCon   20 May 2010                  15
Massive indexing
     Map-reduce indexing models
         Google model
         Nutch model
         Modified Nutch model
         Hadoop contrib/indexing model

     Tradeoff analysis and recommendations




Apache Lucene EuroCon   20 May 2010           16
1




            Solr in the Cloud



                                      Mark Miller
Apache Lucene EuroCon   20 May 2010                 17
Apache Lucene EuroCon   20 May 2010   182
Some of the Complications?

                    Dealing with config files

                    Setting up high availability

                    Status of cluster

                    Reshaping/Rebalancing cluster




Apache Lucene EuroCon   20 May 2010
                                 19                 19
Improvements: High Level Goals

                        Improve...

                           Shared/Central Config

                           High Availability and Fault Tolerance

                           Cluster Resizing/Rebalancing

                           Open/Standard ZK schema

                           Cluster status

Apache Lucene EuroCon       20 May 2010
                                     20
Enter Solr Cloud and ZooKeeper
    ZooKeeper is basically a highly available distributed filesystem

    Config and cluster state ‘live’ in ZooKeeper

    Solr is alerted to changes in cluster state by ZK

    Solr gets a built in load balancing impl that can read cluster state
       from ZK

    Clients don’t need to know about shards - or can choose logical
       shards


Apache Lucene EuroCon   20 May 2010
                                 21
What’s Been Done So Far

                    A lot of ‘base’ work - ZooKeeper Mode

                    Shared/Central config

                    Built in search side fault tolerance

                    Very simple cluster status




Apache Lucene EuroCon   20 May 2010
                                 22
The Future?

                    Index side fault tolerance

                    Cluster resizing/rebalancing/elasticity

                    More Solr/ZK tools?

                    Lots of other little fun improvements




Apache Lucene EuroCon   20 May 2010
                                 23
Practical Relevance


                                Grant Ingersoll

                          Apache Lucene EuroCon 2010
                            Prague, Czech Republic

Apache Lucene EuroCon                                  20 May 2010   24
Why Tune Relevance?
       Better search results = Less time searching, more time acting



       Less time searching = Happier, more effective users



       Happier, more effective users = $, €, £, Kč (earned/saved)



       $, €, £, Kč (earned/saved) = Big fat raise for you!


Apache Lucene EuroCon   20 May 2010                                     25
Testing Relevance
         A/B testing
         Log Analysis
         Empirical
          Top 50 queries, plus random sample

         Ask
          Ratings/Reviews
          Focus Groups

         Also: Ad Hoc, TREC, etc.




Apache Lucene EuroCon   20 May 2010             26
Understand your…
  Domain                                  Tolerance for Pain
          Types of documents
                                             Managers
          Languages present
          Document structures, metadata      Business Interests
          and other features
                                             Release cycles
          Lexical resources: jargon,
          synonyms, abbreviations...         Obsession in finding the
          Relationships between              one true relevance model
          documents
                                             (hint, it doesn’t exist)
  Users                                      “explain() blindness”
          Sophistication/Expertise
          Search and Discovery needs
          Known Item vs. Keyword
Apache Lucene EuroCon   20 May 2010                                     27
Phrases
       Almost always a win to automatically add phrase query
        variations to all multiword queries
              Even better to detect key phrases

       In Solr, with the Dismax handler, use the &pf and &ps options
        to automatically add phrase boosts
       Using a large slop factor can simulate an AND query while
        rewarding close proximity
       See also the ComplexPhraseQuery in contrib/queryparser
       Consider SpanQuery and derivatives
Apache Lucene EuroCon   20 May 2010                                     28
Resources
       ACM SIGIR - http://sigir.org/

       http://www.lucidimagination.com/Community/Hear-from-the-
        Experts/Articles/Debugging-Relevance-Issues-Search

       http://www.lucidimagination.com/Community/Hear-from-the-
        Experts/Articles/Optimizing-Findability-Lucene-and-Solr

       Open Relevance Project:
        http://lucene.apache.org/openrelevance


Apache Lucene EuroCon   20 May 2010                                29
Q&A
                                      SLIDES POSTED AT:
                                       BIT.LY/EXPERTS1




Apache Lucene EuroCon   20 May 2010                       30
1




                             Thank You

Apache Lucene EuroCon   20 May 2010      31

Mais conteúdo relacionado

Destaque

Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrLucidworks (Archived)
 
Pangaea providing access to geoscientific data using apache lucene java
Pangaea   providing access to geoscientific data using apache lucene javaPangaea   providing access to geoscientific data using apache lucene java
Pangaea providing access to geoscientific data using apache lucene javaLucidworks (Archived)
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク彰 村地
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagemgindri
 
A haiti
A haitiA haiti
A haititanica
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
Kelly Clarkson
Kelly ClarksonKelly Clarkson
Kelly Clarksontanica
 
Zombie
ZombieZombie
Zombietanica
 
C:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You MommyC:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You MommyNyiah
 
Azure と世間様
Azure と世間様Azure と世間様
Azure と世間様彰 村地
 

Destaque (17)

Picasso
PicassoPicasso
Picasso
 
Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into Solr
 
Search Analytics What? Why? How?
Search Analytics What? Why? How?Search Analytics What? Why? How?
Search Analytics What? Why? How?
 
Pangaea providing access to geoscientific data using apache lucene java
Pangaea   providing access to geoscientific data using apache lucene javaPangaea   providing access to geoscientific data using apache lucene java
Pangaea providing access to geoscientific data using apache lucene java
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagem
 
Van gogh
Van goghVan gogh
Van gogh
 
A haiti
A haitiA haiti
A haiti
 
Short Presentation
Short PresentationShort Presentation
Short Presentation
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
Kelly Clarkson
Kelly ClarksonKelly Clarkson
Kelly Clarkson
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Zombie
ZombieZombie
Zombie
 
C:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You MommyC:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You Mommy
 
Linked In Introduction
Linked In IntroductionLinked In Introduction
Linked In Introduction
 
Azure と世間様
Azure と世間様Azure と世間様
Azure と世間様
 

Mais de Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 

Mais de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Lucene and Solr Experts Round Table

  • 1. Apache Lucene Eurocon: Preview www.lucene-eurocon.org Apache Lucene EuroCon 20 May 2010
  • 2. Overview A link to download these slides will be available after the webcast is complete. An • Introduction on-demand replay will be ready in ~48 hours. • Near Real Time Search: Yonik Seeley • Munching & Crunching: Andrzej Białecki • Solr in the Cloud: Mark Miller • Practical Relevance: Grant Ingersoll • Q&A Apache Lucene EuroCon 20 May 2010 2
  • 3. Near Real Time Search Yonik Seeley Apache Lucene EuroCon 20 May 2010
  • 4. Near Real-Time Search Shorter times until updates are searchable/visible Lucene 2.9 first laid the groundwork w/ per-segment searching Per-segment FieldCache entries for sorting and FunctionQueries NRT IndexWriter.getReader() Make new segments available before merging is done in background Doesn’t cause commit/fsync first Solr still needs Per-segment faceting Per-segment caching Per-segment statistics (and anything else that uses FieldCache) Apache Lucene EuroCon 20 May 2010 4
  • 5. Existing single-values faceting algorithm Documents matching the Lucene FieldCache Entry base query “Juggernaut” (StringIndex) for the “hero” field q=Juggernaut 0 order: for each &facet=true 2 doc, an index into lookup: the lookup &facet.field=hero 7 the lookup array string values 5 (null) accumulator 3 batman 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2 Apache Lucene EuroCon 20 May 2010 5
  • 6. Per-segment single-valued faceting algorithm Segment1 Segment2 Segment3 Segment4 FieldCache FieldCache FieldCache FieldCache Entry Entry Entry Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 5 1 0 0 2 0 0 4 7 thread4 1 thread2 thread3 Base 2 DocSet thread1 FieldCache + Priority queue accumulator flash, 5 Batman, 3 merger (Priority queue) Apache Lucene EuroCon 20 May 2010 6
  • 7. Per-segment faceting Enable with facet.method=fcs Controllable multi-threading facet.field={!threads=4}myfield Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed) Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded Apache Lucene EuroCon 20 May 2010 7
  • 8. Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Time for request* facet.method=fc facet.method=fcs static index 3 ms 244 ms quickly changing index 1388 ms 267 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B Time for request* facet.method=fc facet.method=fcs static index 26 ms 34 ms quickly changing index 741 ms 94 ms *complete request time, measured externally Apache Lucene EuroCon 20 May 2010 8
  • 9. 9 Munching & Crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com> Apache Lucene EuroCon 20 May 2010
  • 10. Munching & Crunching Agenda Post-processing Splitting, merging, sorting, pruning Tiered search Bitwise search Map-reduce indexing models Apache Lucene EuroCon 20 May 2010 10
  • 11. Post-processing  Isn't it better to build it right from the start?  Some parameters are difficult to get right...  Minimizing index size while retaining search quality  Correcting impact of unexpected common words  Creating evenly-sized shards  ...perhaps impossible to get at all during indexing  Adding collection-wide factors not computed by Lucene (e.g. avg. length)  Optimizing top-N results for common queries  Fitting too large indexes in RAM Apache Lucene EuroCon 20 May 2010 11
  • 12. Merging, splitting, sorting, pruning  Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter   Sorting postings by impact and “early termination” search  Index pruning:  What data to remove and how?  Pruning strategies  Challenges Apache Lucene EuroCon 20 May 2010 12
  • 13. Tiered search  Assuming we CAN prune effectively, while maintaining good search quality... search box RAM 70% pruned SSD 30% pruned ? HDD 0% pruned Apache Lucene EuroCon 20 May 2010 13
  • 14. Tiered search  Assuming we CAN prune effectively, while maintaining good search quality... search box 1 RAM 70% pruned search box 2 SSD 30% pruned ? search box 3 HDD 0% pruned Apache Lucene EuroCon 20 May 2010 14
  • 15. Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find best matching bit patterns in documents  Applications:  Fuzzy “fingerprinting”  De-duplication  Plagiarism detection  BitwiseSearcher and Solr BitwiseField design Apache Lucene EuroCon 20 May 2010 15
  • 16. Massive indexing  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010 16
  • 17. 1 Solr in the Cloud Mark Miller Apache Lucene EuroCon 20 May 2010 17
  • 18. Apache Lucene EuroCon 20 May 2010 182
  • 19. Some of the Complications? Dealing with config files Setting up high availability Status of cluster Reshaping/Rebalancing cluster Apache Lucene EuroCon 20 May 2010 19 19
  • 20. Improvements: High Level Goals Improve...  Shared/Central Config  High Availability and Fault Tolerance  Cluster Resizing/Rebalancing  Open/Standard ZK schema  Cluster status Apache Lucene EuroCon 20 May 2010 20
  • 21. Enter Solr Cloud and ZooKeeper ZooKeeper is basically a highly available distributed filesystem Config and cluster state ‘live’ in ZooKeeper Solr is alerted to changes in cluster state by ZK Solr gets a built in load balancing impl that can read cluster state from ZK Clients don’t need to know about shards - or can choose logical shards Apache Lucene EuroCon 20 May 2010 21
  • 22. What’s Been Done So Far A lot of ‘base’ work - ZooKeeper Mode Shared/Central config Built in search side fault tolerance Very simple cluster status Apache Lucene EuroCon 20 May 2010 22
  • 23. The Future? Index side fault tolerance Cluster resizing/rebalancing/elasticity More Solr/ZK tools? Lots of other little fun improvements Apache Lucene EuroCon 20 May 2010 23
  • 24. Practical Relevance Grant Ingersoll Apache Lucene EuroCon 2010 Prague, Czech Republic Apache Lucene EuroCon 20 May 2010 24
  • 25. Why Tune Relevance?  Better search results = Less time searching, more time acting  Less time searching = Happier, more effective users  Happier, more effective users = $, €, £, Kč (earned/saved)  $, €, £, Kč (earned/saved) = Big fat raise for you! Apache Lucene EuroCon 20 May 2010 25
  • 26. Testing Relevance  A/B testing  Log Analysis  Empirical  Top 50 queries, plus random sample  Ask  Ratings/Reviews  Focus Groups  Also: Ad Hoc, TREC, etc. Apache Lucene EuroCon 20 May 2010 26
  • 27. Understand your… Domain Tolerance for Pain Types of documents Managers Languages present Document structures, metadata Business Interests and other features Release cycles Lexical resources: jargon, synonyms, abbreviations... Obsession in finding the Relationships between one true relevance model documents (hint, it doesn’t exist) Users “explain() blindness” Sophistication/Expertise Search and Discovery needs Known Item vs. Keyword Apache Lucene EuroCon 20 May 2010 27
  • 28. Phrases  Almost always a win to automatically add phrase query variations to all multiword queries  Even better to detect key phrases  In Solr, with the Dismax handler, use the &pf and &ps options to automatically add phrase boosts  Using a large slop factor can simulate an AND query while rewarding close proximity  See also the ComplexPhraseQuery in contrib/queryparser  Consider SpanQuery and derivatives Apache Lucene EuroCon 20 May 2010 28
  • 29. Resources  ACM SIGIR - http://sigir.org/  http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Debugging-Relevance-Issues-Search  http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Optimizing-Findability-Lucene-and-Solr  Open Relevance Project: http://lucene.apache.org/openrelevance Apache Lucene EuroCon 20 May 2010 29
  • 30. Q&A SLIDES POSTED AT: BIT.LY/EXPERTS1 Apache Lucene EuroCon 20 May 2010 30
  • 31. 1 Thank You Apache Lucene EuroCon 20 May 2010 31