SlideShare uma empresa Scribd logo
1 de 31
Apache Lucene Eurocon: Preview
                         www.lucene-eurocon.org



Apache Lucene EuroCon                              20 May 2010
Overview                                 A link to download these
                                           slides will be available after
                                           the webcast is complete. An
  • Introduction                           on-demand replay will be
                                           ready in ~48 hours.

  • Near Real Time Search: Yonik Seeley

  • Munching & Crunching: Andrzej Białecki

  • Solr in the Cloud: Mark Miller

  • Practical Relevance: Grant Ingersoll

  • Q&A

Apache Lucene EuroCon   20 May 2010                                         2
Near Real Time Search


                               Yonik Seeley


Apache Lucene EuroCon                           20 May 2010
Near Real-Time Search
        Shorter times until updates are searchable/visible

        Lucene 2.9 first laid the groundwork w/ per-segment searching
                Per-segment FieldCache entries for sorting and FunctionQueries
                NRT IndexWriter.getReader()
                        Make new segments available before merging is done in background
                        Doesn’t cause commit/fsync first

        Solr still needs
                Per-segment faceting
                Per-segment caching
                Per-segment statistics (and anything else that uses FieldCache)


Apache Lucene EuroCon      20 May 2010                                                     4
Existing single-values faceting algorithm
                             Documents matching the
                                                           Lucene FieldCache Entry
                             base query “Juggernaut”
                                                           (StringIndex) for the “hero” field
 q=Juggernaut                             0               order: for each
 &facet=true                              2               doc, an index into   lookup: the
                                              lookup
 &facet.field=hero                        7               the lookup array     string values
                                                                 5                (null)
                                      accumulator                3              batman
                                                                 5                flash
                                          0
                                                                 1             spiderman
                                          1
                                                                 4             superman
                                          0   increment
                                                                 5             wolverine
                                          0
                                                                 2
                                          0
                                                                 1
                                          2
Apache Lucene EuroCon   20 May 2010                                                             5
Per-segment single-valued faceting algorithm
                        Segment1             Segment2            Segment3          Segment4
                        FieldCache           FieldCache          FieldCache        FieldCache
                           Entry                Entry               Entry             Entry
                              accumulator1    accumulator2          accumulator3      accumulator4
                        inc
          lookup               0                0                   1                  0
                               3                2                   3                  1
         0
                               5                1                   0                  0
         2
                               0                0                   4
         7                                                                          thread4
                               1             thread2            thread3
       Base                    2
       DocSet
                          thread1                      FieldCache +                Priority queue
                                                       accumulator                    flash, 5
                                                                                    Batman, 3
                                                       merger
                                                       (Priority queue)
Apache Lucene EuroCon     20 May 2010                                                               6
Per-segment faceting
          Enable with facet.method=fcs

          Controllable multi-threading
                  facet.field={!threads=4}myfield

          Disadvantages
                  Larger memory use (FieldCaches + accumulators)
                  Slower (extra FieldCache merge step needed)

          Advantages
                  Rebuilds FieldCache entries only for new segments (NRT friendly)
                  Multi-threaded

Apache Lucene EuroCon   20 May 2010                                                  7
Per-segment faceting performance comparison
        Test index: 10M documents, 18 segments, single valued field

        Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

A       Time for request*             facet.method=fc           facet.method=fcs
        static index                  3 ms                      244 ms
        quickly changing index        1388 ms                   267 ms


        Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

B       Time for request*             facet.method=fc           facet.method=fcs
        static index                  26 ms                     34 ms
        quickly changing index        741 ms                    94 ms

                            *complete request time, measured externally
Apache Lucene EuroCon   20 May 2010                                                8
9




                        Munching & Crunching
                        Lucene index post-processing and applications




                                      Andrzej Białecki

                          <andrzej.bialecki@lucidimagination.com>

Apache Lucene EuroCon                                                   20 May 2010
Munching & Crunching Agenda
                 Post-processing
                        Splitting, merging, sorting, pruning

                 Tiered search

                 Bitwise search

                 Map-reduce indexing models




Apache Lucene EuroCon        20 May 2010                       10
Post-processing
     Isn't it better to build it right from the start?

     Some parameters are difficult to get right...
                  Minimizing index size while retaining search quality
                  Correcting impact of unexpected common words
                  Creating evenly-sized shards

      ...perhaps impossible to get at all during indexing
                  Adding collection-wide factors not computed by Lucene (e.g. avg. length)
                  Optimizing top-N results for common queries
                  Fitting too large indexes in RAM


Apache Lucene EuroCon    20 May 2010                                                          11
Merging, splitting, sorting, pruning
     Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter 

     Sorting postings by impact and “early termination” search

     Index pruning:
         What data to remove and how?
         Pruning strategies
         Challenges




Apache Lucene EuroCon   20 May 2010                                     12
Tiered search
     Assuming we CAN prune effectively, while maintaining good
      search quality...
                                               search box
                                               RAM
                                               70% pruned


                                             SSD
                                         30% pruned         ?
                                       HDD

                                      0% pruned


Apache Lucene EuroCon   20 May 2010                               13
Tiered search
     Assuming we CAN prune effectively, while maintaining good
      search quality...
 search box 1
                                               RAM
                                               70% pruned

 search box 2
                                             SSD
                                         30% pruned         ?
 search box 3
                                       HDD

                                      0% pruned

Apache Lucene EuroCon   20 May 2010                               14
Bit-wise search
     Given a bit pattern query:
                 1010 1001 0101 0001

     Find best matching bit patterns in documents

     Applications:
         Fuzzy “fingerprinting”
         De-duplication
         Plagiarism detection

     BitwiseSearcher and Solr BitwiseField design

Apache Lucene EuroCon   20 May 2010                  15
Massive indexing
     Map-reduce indexing models
         Google model
         Nutch model
         Modified Nutch model
         Hadoop contrib/indexing model

     Tradeoff analysis and recommendations




Apache Lucene EuroCon   20 May 2010           16
1




            Solr in the Cloud



                                      Mark Miller
Apache Lucene EuroCon   20 May 2010                 17
Apache Lucene EuroCon   20 May 2010   182
Some of the Complications?

                    Dealing with config files

                    Setting up high availability

                    Status of cluster

                    Reshaping/Rebalancing cluster




Apache Lucene EuroCon   20 May 2010
                                 19                 19
Improvements: High Level Goals

                        Improve...

                           Shared/Central Config

                           High Availability and Fault Tolerance

                           Cluster Resizing/Rebalancing

                           Open/Standard ZK schema

                           Cluster status

Apache Lucene EuroCon       20 May 2010
                                     20
Enter Solr Cloud and ZooKeeper
    ZooKeeper is basically a highly available distributed filesystem

    Config and cluster state ‘live’ in ZooKeeper

    Solr is alerted to changes in cluster state by ZK

    Solr gets a built in load balancing impl that can read cluster state
       from ZK

    Clients don’t need to know about shards - or can choose logical
       shards


Apache Lucene EuroCon   20 May 2010
                                 21
What’s Been Done So Far

                    A lot of ‘base’ work - ZooKeeper Mode

                    Shared/Central config

                    Built in search side fault tolerance

                    Very simple cluster status




Apache Lucene EuroCon   20 May 2010
                                 22
The Future?

                    Index side fault tolerance

                    Cluster resizing/rebalancing/elasticity

                    More Solr/ZK tools?

                    Lots of other little fun improvements




Apache Lucene EuroCon   20 May 2010
                                 23
Practical Relevance


                                Grant Ingersoll

                          Apache Lucene EuroCon 2010
                            Prague, Czech Republic

Apache Lucene EuroCon                                  20 May 2010   24
Why Tune Relevance?
       Better search results = Less time searching, more time acting



       Less time searching = Happier, more effective users



       Happier, more effective users = $, €, £, Kč (earned/saved)



       $, €, £, Kč (earned/saved) = Big fat raise for you!


Apache Lucene EuroCon   20 May 2010                                     25
Testing Relevance
         A/B testing
         Log Analysis
         Empirical
          Top 50 queries, plus random sample

         Ask
          Ratings/Reviews
          Focus Groups

         Also: Ad Hoc, TREC, etc.




Apache Lucene EuroCon   20 May 2010             26
Understand your…
  Domain                                  Tolerance for Pain
          Types of documents
                                             Managers
          Languages present
          Document structures, metadata      Business Interests
          and other features
                                             Release cycles
          Lexical resources: jargon,
          synonyms, abbreviations...         Obsession in finding the
          Relationships between              one true relevance model
          documents
                                             (hint, it doesn’t exist)
  Users                                      “explain() blindness”
          Sophistication/Expertise
          Search and Discovery needs
          Known Item vs. Keyword
Apache Lucene EuroCon   20 May 2010                                     27
Phrases
       Almost always a win to automatically add phrase query
        variations to all multiword queries
              Even better to detect key phrases

       In Solr, with the Dismax handler, use the &pf and &ps options
        to automatically add phrase boosts
       Using a large slop factor can simulate an AND query while
        rewarding close proximity
       See also the ComplexPhraseQuery in contrib/queryparser
       Consider SpanQuery and derivatives
Apache Lucene EuroCon   20 May 2010                                     28
Resources
       ACM SIGIR - http://sigir.org/

       http://www.lucidimagination.com/Community/Hear-from-the-
        Experts/Articles/Debugging-Relevance-Issues-Search

       http://www.lucidimagination.com/Community/Hear-from-the-
        Experts/Articles/Optimizing-Findability-Lucene-and-Solr

       Open Relevance Project:
        http://lucene.apache.org/openrelevance


Apache Lucene EuroCon   20 May 2010                                29
Q&A
                                      SLIDES POSTED AT:
                                       BIT.LY/EXPERTS1




Apache Lucene EuroCon   20 May 2010                       30
1




                             Thank You

Apache Lucene EuroCon   20 May 2010      31

Mais conteúdo relacionado

Destaque

Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrLucidworks (Archived)
 
Pangaea providing access to geoscientific data using apache lucene java
Pangaea   providing access to geoscientific data using apache lucene javaPangaea   providing access to geoscientific data using apache lucene java
Pangaea providing access to geoscientific data using apache lucene javaLucidworks (Archived)
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク彰 村地
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagemgindri
 
A haiti
A haitiA haiti
A haititanica
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
Kelly Clarkson
Kelly ClarksonKelly Clarkson
Kelly Clarksontanica
 
Zombie
ZombieZombie
Zombietanica
 
C:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You MommyC:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You MommyNyiah
 
Azure と世間様
Azure と世間様Azure と世間様
Azure と世間様彰 村地
 

Destaque (17)

Picasso
PicassoPicasso
Picasso
 
Integrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into SolrIntegrating Advanced Text Analytics into Solr
Integrating Advanced Text Analytics into Solr
 
Search Analytics What? Why? How?
Search Analytics What? Why? How?Search Analytics What? Why? How?
Search Analytics What? Why? How?
 
Pangaea providing access to geoscientific data using apache lucene java
Pangaea   providing access to geoscientific data using apache lucene javaPangaea   providing access to geoscientific data using apache lucene java
Pangaea providing access to geoscientific data using apache lucene java
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク
 
Descritores de linguagem
Descritores de linguagemDescritores de linguagem
Descritores de linguagem
 
Van gogh
Van goghVan gogh
Van gogh
 
A haiti
A haitiA haiti
A haiti
 
Short Presentation
Short PresentationShort Presentation
Short Presentation
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
Kelly Clarkson
Kelly ClarksonKelly Clarkson
Kelly Clarkson
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Zombie
ZombieZombie
Zombie
 
C:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You MommyC:\Fakepath\I Love You Mommy
C:\Fakepath\I Love You Mommy
 
Linked In Introduction
Linked In IntroductionLinked In Introduction
Linked In Introduction
 
Azure と世間様
Azure と世間様Azure と世間様
Azure と世間様
 

Mais de Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 

Mais de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 

Último

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Lucene and Solr Experts Round Table

  • 1. Apache Lucene Eurocon: Preview www.lucene-eurocon.org Apache Lucene EuroCon 20 May 2010
  • 2. Overview A link to download these slides will be available after the webcast is complete. An • Introduction on-demand replay will be ready in ~48 hours. • Near Real Time Search: Yonik Seeley • Munching & Crunching: Andrzej Białecki • Solr in the Cloud: Mark Miller • Practical Relevance: Grant Ingersoll • Q&A Apache Lucene EuroCon 20 May 2010 2
  • 3. Near Real Time Search Yonik Seeley Apache Lucene EuroCon 20 May 2010
  • 4. Near Real-Time Search Shorter times until updates are searchable/visible Lucene 2.9 first laid the groundwork w/ per-segment searching Per-segment FieldCache entries for sorting and FunctionQueries NRT IndexWriter.getReader() Make new segments available before merging is done in background Doesn’t cause commit/fsync first Solr still needs Per-segment faceting Per-segment caching Per-segment statistics (and anything else that uses FieldCache) Apache Lucene EuroCon 20 May 2010 4
  • 5. Existing single-values faceting algorithm Documents matching the Lucene FieldCache Entry base query “Juggernaut” (StringIndex) for the “hero” field q=Juggernaut 0 order: for each &facet=true 2 doc, an index into lookup: the lookup &facet.field=hero 7 the lookup array string values 5 (null) accumulator 3 batman 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2 Apache Lucene EuroCon 20 May 2010 5
  • 6. Per-segment single-valued faceting algorithm Segment1 Segment2 Segment3 Segment4 FieldCache FieldCache FieldCache FieldCache Entry Entry Entry Entry accumulator1 accumulator2 accumulator3 accumulator4 inc lookup 0 0 1 0 3 2 3 1 0 5 1 0 0 2 0 0 4 7 thread4 1 thread2 thread3 Base 2 DocSet thread1 FieldCache + Priority queue accumulator flash, 5 Batman, 3 merger (Priority queue) Apache Lucene EuroCon 20 May 2010 6
  • 7. Per-segment faceting Enable with facet.method=fcs Controllable multi-threading facet.field={!threads=4}myfield Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed) Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded Apache Lucene EuroCon 20 May 2010 7
  • 8. Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field Base DocSet=100 docs, facet.field on a field with 100,000 unique terms A Time for request* facet.method=fc facet.method=fcs static index 3 ms 244 ms quickly changing index 1388 ms 267 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms B Time for request* facet.method=fc facet.method=fcs static index 26 ms 34 ms quickly changing index 741 ms 94 ms *complete request time, measured externally Apache Lucene EuroCon 20 May 2010 8
  • 9. 9 Munching & Crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com> Apache Lucene EuroCon 20 May 2010
  • 10. Munching & Crunching Agenda Post-processing Splitting, merging, sorting, pruning Tiered search Bitwise search Map-reduce indexing models Apache Lucene EuroCon 20 May 2010 10
  • 11. Post-processing  Isn't it better to build it right from the start?  Some parameters are difficult to get right...  Minimizing index size while retaining search quality  Correcting impact of unexpected common words  Creating evenly-sized shards  ...perhaps impossible to get at all during indexing  Adding collection-wide factors not computed by Lucene (e.g. avg. length)  Optimizing top-N results for common queries  Fitting too large indexes in RAM Apache Lucene EuroCon 20 May 2010 11
  • 12. Merging, splitting, sorting, pruning  Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter   Sorting postings by impact and “early termination” search  Index pruning:  What data to remove and how?  Pruning strategies  Challenges Apache Lucene EuroCon 20 May 2010 12
  • 13. Tiered search  Assuming we CAN prune effectively, while maintaining good search quality... search box RAM 70% pruned SSD 30% pruned ? HDD 0% pruned Apache Lucene EuroCon 20 May 2010 13
  • 14. Tiered search  Assuming we CAN prune effectively, while maintaining good search quality... search box 1 RAM 70% pruned search box 2 SSD 30% pruned ? search box 3 HDD 0% pruned Apache Lucene EuroCon 20 May 2010 14
  • 15. Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find best matching bit patterns in documents  Applications:  Fuzzy “fingerprinting”  De-duplication  Plagiarism detection  BitwiseSearcher and Solr BitwiseField design Apache Lucene EuroCon 20 May 2010 15
  • 16. Massive indexing  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010 16
  • 17. 1 Solr in the Cloud Mark Miller Apache Lucene EuroCon 20 May 2010 17
  • 18. Apache Lucene EuroCon 20 May 2010 182
  • 19. Some of the Complications? Dealing with config files Setting up high availability Status of cluster Reshaping/Rebalancing cluster Apache Lucene EuroCon 20 May 2010 19 19
  • 20. Improvements: High Level Goals Improve...  Shared/Central Config  High Availability and Fault Tolerance  Cluster Resizing/Rebalancing  Open/Standard ZK schema  Cluster status Apache Lucene EuroCon 20 May 2010 20
  • 21. Enter Solr Cloud and ZooKeeper ZooKeeper is basically a highly available distributed filesystem Config and cluster state ‘live’ in ZooKeeper Solr is alerted to changes in cluster state by ZK Solr gets a built in load balancing impl that can read cluster state from ZK Clients don’t need to know about shards - or can choose logical shards Apache Lucene EuroCon 20 May 2010 21
  • 22. What’s Been Done So Far A lot of ‘base’ work - ZooKeeper Mode Shared/Central config Built in search side fault tolerance Very simple cluster status Apache Lucene EuroCon 20 May 2010 22
  • 23. The Future? Index side fault tolerance Cluster resizing/rebalancing/elasticity More Solr/ZK tools? Lots of other little fun improvements Apache Lucene EuroCon 20 May 2010 23
  • 24. Practical Relevance Grant Ingersoll Apache Lucene EuroCon 2010 Prague, Czech Republic Apache Lucene EuroCon 20 May 2010 24
  • 25. Why Tune Relevance?  Better search results = Less time searching, more time acting  Less time searching = Happier, more effective users  Happier, more effective users = $, €, £, Kč (earned/saved)  $, €, £, Kč (earned/saved) = Big fat raise for you! Apache Lucene EuroCon 20 May 2010 25
  • 26. Testing Relevance  A/B testing  Log Analysis  Empirical  Top 50 queries, plus random sample  Ask  Ratings/Reviews  Focus Groups  Also: Ad Hoc, TREC, etc. Apache Lucene EuroCon 20 May 2010 26
  • 27. Understand your… Domain Tolerance for Pain Types of documents Managers Languages present Document structures, metadata Business Interests and other features Release cycles Lexical resources: jargon, synonyms, abbreviations... Obsession in finding the Relationships between one true relevance model documents (hint, it doesn’t exist) Users “explain() blindness” Sophistication/Expertise Search and Discovery needs Known Item vs. Keyword Apache Lucene EuroCon 20 May 2010 27
  • 28. Phrases  Almost always a win to automatically add phrase query variations to all multiword queries  Even better to detect key phrases  In Solr, with the Dismax handler, use the &pf and &ps options to automatically add phrase boosts  Using a large slop factor can simulate an AND query while rewarding close proximity  See also the ComplexPhraseQuery in contrib/queryparser  Consider SpanQuery and derivatives Apache Lucene EuroCon 20 May 2010 28
  • 29. Resources  ACM SIGIR - http://sigir.org/  http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Debugging-Relevance-Issues-Search  http://www.lucidimagination.com/Community/Hear-from-the- Experts/Articles/Optimizing-Findability-Lucene-and-Solr  Open Relevance Project: http://lucene.apache.org/openrelevance Apache Lucene EuroCon 20 May 2010 29
  • 30. Q&A SLIDES POSTED AT: BIT.LY/EXPERTS1 Apache Lucene EuroCon 20 May 2010 30
  • 31. 1 Thank You Apache Lucene EuroCon 20 May 2010 31