SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
1




Munching & crunching
Lucene index post-processing and applications



              Andrzej Białecki

  <andrzej.bialecki@lucidimagination.com>
Intro
   Started using Lucene in 2003 (1.2-dev?)
   Created Luke – the Lucene Index Toolbox
   Nutch, Hadoop committer, Lucene PMC member
   Nutch project lead
Munching and crunching? But really...
   Stir your imagination
   Think outside the box
   Show some unorthodox use and practical applications
   Close ties to scalability, performance, distributed search and
    query latency
Agenda
  ●   Post-processing
      ●   Splitting, merging, sorting, pruning

  ●   Tiered search
  ●   Bit-wise search
  ●   (Map-reduce indexing models)




Apache Lucene EuroCon   20 May 2010
Why post-process indexes?
     Isn't it better to build them right from the start?
     Sometimes it's not convenient or feasible
         Correcting impact of unexpected common words
         Targetting specific index size or composition:
             Creating evenly-sized shards
             Re-balancing shards across servers
             Fitting indexes completely in RAM

     … and sometimes impossible to do it right
         Trimming index size while retaining quality of top-N results


Apache Lucene EuroCon   20 May 2010
Merging indexes
     It's easy to merge several small indexes into one
     Fundamental Lucene operation during indexing
      (SegmentMerger)
         Command-line utilities exist: IndexMergeTool
         API:
             IndexWriter.addIndexes(IndexReader...)
             IndexWriter.addIndexesNoOptimize(Directory...)
             Hopefully a more flexible API on the flex branch

     Solr: through CoreAdmin action=mergeindexes
             Note: schema must be compatible
Apache Lucene EuroCon   20 May 2010
Splitting indexes                                                             original index
                                                                                 segments_2
     IndexSplitter tool:
         Moves whole segments to standalone indexes
                                                                   _0                _1           _2
             Pros: nearly no IO/CPU involved – just rename &
              create new SegmentInfos file
              Cons:




                                                                   segments_0




                                                                                                  segments_0
          




                                                                                     segments_0
                 Requires a multi-segment index!
                 Very limited control over content of resulting
                  indexes → MergePolicy

                                                                                new indexes




Apache Lucene EuroCon    20 May 2010
Splitting indexes, take 2                                                 original index




                                                                            del2


                                                                                   del1
                                                                                          d1
   MultiPassIndexSplitter tool:
                                                                                         d2
     Uses an IndexReader that keeps the list of deletions in memory
     The source index remains unmodified                                                d3
     For each partition:                                                                d4
       Marks all source documents not in the partition as deleted
       Writes a target split using IndexWriter.addIndexes(IndexReader)
          IndexWriter knows how to skip deleted documents
       Removes the “deleted” mark from all source documents            pass 1     pass 2
   Pros:                                                                     d1         d2
     Arbitrary splits possible (even partially overlapping)
                                                                              d3         d4
     Source index remains intact
   Cons:                                                                     new indexes
     Reads complete index N times – I/O is O(N * indexSize)
     Takes twice as much space (source index remains intact)
      … but maybe it's a feature? 

Apache Lucene EuroCon   20 May 2010
Splitting indexes, take 3                                  1 2 3 4 5 6 7 8 9 10 ...
                                                                                          stored fields
                                                                                            term dict
     SinglePassSplitter                                                                postings+payloads
         Uses the same processing workflow as                                            term vectors
          SegmentMerger, only with multiple outputs
             Write new SegmentInfos and FieldInfos                       1 3 5…           1' 2' 3' 4' 5' 6'...
                                                                                                 stored
             Merge (pass-through) stored fields                                                 terms
                                                              partitioner
             Merge (pass-through) term dictionary                                              postings
                                                                                              term vectors
             Merge (pass-through) postings with payloads
                                                                          246…             1' 2' 3' 4' 5' 6'...
             Merge (pass-through) term vectors
                                                                                                 stored
         Renumbers document id-s on-the-fly to form                                             terms
          contiguous space                                                                      postings
                                                                                              term vectors
         Pros: flexibility as with MultiPassIndexSplitter
         Status: work started, to be contributed soon...                    renumber
Apache Lucene EuroCon   20 May 2010
Splitting indexes, summary
     SinglePassSplitter – best tradeoff of flexibility/IO/CPU
     Interesting scenarios with SinglePassSplitter:
         Split by ranges, round-robin, by field value, by frequency, to a target size, etc...
         “Extract” handful of documents to a separate index
         “Move” documents between indexes:
             “extract” from source
             Add to target (merge)
             Delete from source
         Now the source index may reside on a network FS – the amount of IO is
          O(1 * indexSize)

Apache Lucene EuroCon   20 May 2010
Index sorting - introduction
     “Early termination” technique
         If full execution of a query takes too long then terminate and estimate

     Termination conditions:
         Number of documents – LimitedCollector in Nutch
         Time – TimeLimitingCollector
          (see also extended LUCENE-1720 TimeLimitingIndexReader)

     Problems:
         Difficult to estimate total hits
         Important docs may not be collected if they have high docID-s


Apache Lucene EuroCon   20 May 2010
Index sorting - details                 early termination == poor
                                                                      original index
     Define a global ordering of           0 1 2 3 4 5 6 7 doc ID
                                            c e h f a d g b rank
      documents (e.g. PageRank,
      popularity, quality, etc)
         Documents with good rank                                    ID mapping
          should generally score higher     4 7 0 5 1 3 6 2 old doc ID
                                            0 1 2 3 4 5 6 7 new doc ID
     Sort (internal) ID-s by this
      ordering, descending
                                                                      sorted index
     Map from old to new ID-s              0 1 2 3 4 5 6 7 doc ID
      to follow this ordering               a b c d e f g h rank

                                           early termination == good
     Change the ID-s in postings
Apache Lucene EuroCon   20 May 2010
Index sorting - summary
     Implementation in Nutch: IndexSorter
         Based on PageRank – sorts by decreasing page quality
         Uses FilterIndexReader

     NOTE: “Early termination” will (significantly) reduce quality of
      results with non-sorted indexes – use both or neither




Apache Lucene EuroCon   20 May 2010
Index pruning
     Quick refresh on the index composition:
         Stored fields
         Term dictionary
         Term frequency data
         Positional data (postings)
             With or without payload data
         Term frequency vectors

     Number of documents may be into millions
     Number of terms commonly is well into millions
         Not to mention individual postings …
Apache Lucene EuroCon   20 May 2010
Index pruning & top-N retrieval
     N is usually << 1000
     Very often search quality is judged based on top-20
     Question:
       Do we really need to keep and process ALL terms and ALL
        postings for a good-quality top-N search for common
        queries?




Apache Lucene EuroCon   20 May 2010
Index pruning hypothesis
     There should be a way to remove some of the less important
      data
         While retaining the quality of top-N results!
     Question: what data is less important?
     Some answers:
         That of poorly-scoring documents
         That of common (less selective) terms
     Dynamic pruning: skips less relevant data during query
      processing → runtime cost...
     But can we do this work in advance (static pruning)?
Apache Lucene EuroCon   20 May 2010
What do we need for top-N results?
     Work backwards
     “Foreach” common query:
         Run it against the full index
         Record the top-N matching documents

     “Foreach” document in results:
         Record terms and term positions that contributed to the score

     Finally: remove all non-recorded postings and terms
     First proposed by D. Carmel (2001) for single term queries
Apache Lucene EuroCon   20 May 2010
… but it's too simplistic:
                                  0 quick         0 quick
           before pruning         1 brown         1 brown        after pruning
                                  2 fox           2 fox

                        Query 1: brown       - topN(full) == topN(pruned)
                        Query 2: “brown fox” - topN(full) != topN(pruned)

      Hmm, what about less common queries?
          80/20 rule of “good enough”?

      Term-level is too primitive
          Document-centric pruning
          Impact-centric pruning
          Position-centric pruning
Apache Lucene EuroCon   20 May 2010
Smarter pruning                                    Freq

     Not all term positions are equally                    corpus language
      important                                             model
                                                            document language
     Metrics of term and position                          model
      importance:
         Plain in-document term frequency (TF)
         TF-IDF score obtained from top-N results
          of TermQuery (Carmel method)
         Residual IDF – measure of term
          informativeness (selectivity)
         Key-phrase positions, or term clusters
         Kullback-Leibler divergence from a                            Term
          language model                   →
Apache Lucene EuroCon   20 May 2010
Applications
     Obviously, performance-related
         Some papers claim a modest impact on quality when pruning up to 60% of
          postings
         See LUCENE-1812 for some benchmarks confirming this claim

     Removal / restructuring of (some) stored content
     Legacy indexes, or ones created with a fossilized external chain




Apache Lucene EuroCon   20 May 2010
Stored field pruning
     Some stored data can be compacted, removed, or restructured
     Use case: source text for generating “snippets”
         Split content into sentences
         Reorder sentences by a static “importance” score (e.g. how many rare terms they
          contain)
             NOTE: this may use collection wide statistics!
         Remove the bottom x% of sentences




Apache Lucene EuroCon   20 May 2010
LUCENE-1812: contrib/pruning tools and API
     Based on FilterIndexReader
     Produces output indexes via
      IndexWriter.addIndexes(IndexReader[])

     Design:
         PruningReader – subclass of FilterIndexReader with necessary boilerplate and
          hooks for pruning policies
         StorePruningPolicy – implements rules for modifying stored fields (and list of field
          names)
         TermPruningPolicy – implements rules for modifying term dictionary, postings and
          payloads
         PruningTool – command-line utility to configure and run PruningReader
Apache Lucene EuroCon   20 May 2010
Details of LUCENE-1812
      source index                                                           target index
       stored fields                  StorePruningPolicy                       stored fields




                                                               IndexWriter
          term dict                                                              term dict
  postings+payloads                   TermPruningPolicy                      postings+payloads
      term vectors                                                             term vectors

                                       PruningReader
                                                           IW.addIndexes(IndexReader...)
     IndexWriter consumes source data filtered via PruningReader
     Internal document ID-s are preserved – suitable for bitset ops
      and retrieval by internal ID
         If source index has no deletions
         If target index is empty
Apache Lucene EuroCon   20 May 2010
API: StorePruningPolicy
     May remove (some) fields from (some) documents
     May as well modify the values
     May rename / add fields




Apache Lucene EuroCon   20 May 2010
API: TermPruningPolicy
     Thresholds (in the order of precedence):
         Per term
         Per field
         Default

     Plain TF pruning – TFTermPruningPolicy
         Removes all postings for a term where TF (in-document term frequency) is below
          a threshold

     Top-N term-level – CarmelTermPruningPolicy
         TermQuery search for top-N docs
         Removes all postings for a term outside the top-N docs

Apache Lucene EuroCon   20 May 2010
Results so far...
     TF pruning:
         Term query recall very good
         Phrase query recall very poor – expected...

     Carmel pruning – slightly better term position selection, but
      still heavy negative impact on phrase queries
     Recognizing and keeping key phrases would help
         Use query log for frequent-phrase mining?
         Use collocation miner (Mahout)?
         Savings on pruning will be smaller, but quality will significantly improve


Apache Lucene EuroCon   20 May 2010
References
     Static Index Pruning for Information Retrieval Systems, Carmel
      et al, SIGIR'01
     A document-centric approach to static index pruning in text
      retrieval systems, Büttcher & Clark, CIKM'06
     Locality-based pruning methods for web search, deMoura et al,
      ACM TIS '08
     Pruning strategies for mixed-mode querying, Anh & Moffat,
      CIKM'06

Apache Lucene EuroCon   20 May 2010
Index pruning applied ...
     Index 1: A heavily pruned index that fits in RAM:
         excellent speed
         poor search quality for many less-common query types
     Index 2: Slightly pruned index that fits partially in RAM:
         good speed, good quality for many common query types,
         still poor quality for some other rare query types
     Index 3: Full index on disk:
         Slow speed
         Excellent quality for all query types
     QUESTION: Can we come up with a combined search strategy?
Apache Lucene EuroCon   20 May 2010
Tiered search
   search box 1
   search box 1
                                          RAM
                                                70% pruned


   search box 2
   search box 2                     SSD
                                           30% pruned          ?
                                                              predict
                                                             evaluate
   search box 3
   search box 3         HDD

                                        0% pruned




    Can we predict the best tier without actually running the query?
    How to evaluate if the predictor was right?
Apache Lucene EuroCon     20 May 2010
Tiered search: tier selector and evaluator
     Best tier can be predicted (often enough ):
         Carmel pruning yields excellent results for simple term queries
         Phrase-based pruning yields good results for phrase queries (though less often)

     Quality evaluator: when is predictor wrong?
         Could be very complex, based on gold standard and qrels
         Could be very simple: acceptable number of results

     Fall-back strategy:
         Serial: poor latency, but minimizes load on bulkier tiers
         Partially parallel:
             submit to the next tier only the border-line queries
             Pick the first acceptable answer – reduces latency
Apache Lucene EuroCon   20 May 2010
Tiered versus distributed
     Both applicable to indexes and query loads exceeding single
      machine capabilities
     Distributed sharded search:
         increases latency for all queries (send + execute + integrate from all shards)
             … plus replicas to increase QPS:
                 Increases hardware / management costs
                 While not improving latency

     Tiered search:
         Excellent latency for common queries
         More complex to build and maintain
         Arguably lower hardware cost for comparable scale / QPS
Apache Lucene EuroCon   20 May 2010
Tiered search benefits
     Majority of common queries handled by first tier: RAM-based,
      high QPS, low latency
     Partially parallel mode reduces average latency for more
      complex queries
     Hardware investment likely smaller than for distributed search
      setup of comparable QPS / latency




Apache Lucene EuroCon   20 May 2010
Example Lucene API for tiered search
                                      Could be implemented as
                                      a Solr SearchComponent...




Apache Lucene EuroCon   20 May 2010
Lucene implementation details




Apache Lucene EuroCon   20 May 2010
References
     Efficiency trade-offs in two-tier web search systems, Baeza-
      Yates et al., SIGIR'09
     ResIn: A combination of results caching and index pruning for
      high-performance web search engines, Baeza-Yates et al,
      SIGIR'08
     Three-level caching for efficient query processing in large Web
      search engines, Long & Suel, WWW'05



Apache Lucene EuroCon   20 May 2010
Bit-wise search
     Given a bit pattern query:
      1010 1001 0101 0001
     Find documents with matching bit patterns in a field
     Applications:
         Permission checking
         De-duplication
         Plagiarism detection

     Two variants: non-scoring (filtering) and scoring

Apache Lucene EuroCon   20 May 2010
Non-scoring bitwise search (LUCENE-2460)
     Builds a Filter from intersection of:                0   1    2    3     4  docID
                                                         0x01 0x02 0x03 0x04 0x05 flags
         DocIdSet of documents matching a Query           a   b    b     a    a  type
         Integer value and operation (AND, OR, XOR)
                                                                                  “type:a”
         “Value source” that caches integer values of
          a field (from FieldCache)
                                                         0x01 0x02 0x03 0x04 0x05 flags
     Corresponding Solr field type and
      QParser: SOLR-1913                                 op=AND       val=0x01

     Useful for filtering (not scoring)

                                                                                   Filter

Apache Lucene EuroCon   20 May 2010
Scoring bitwise search (SOLR-1918)
     BooleanQuery in disguise:                        docID    D1           D2     D3
                                                       flags   1010         1011   0011
          1010 = Y-1000 | N-0100 |
                                                               Y1000    Y1000      N1000
                        Y-0010 | N-0001
                                                       bits    N0100    N0100      N0100
                                                               Y0010    Y0010      Y0010
     Solr 32-bit BitwiseField                                 N0001    Y0001      Y0001
         Analyzer creates the bitmasks field
         Currently supports only single value per field   Q = bits:Y1000 bits:N0100
                                                               bits:Y0010 bits:N0001
         Creates BooleanQuery from query int value
                                                                 Results:
     Useful when searching for best
      matching (ranked) bit patterns                             D1 matches 4 of 4 → #1
                                                                 D2 matches 3 of 4 → #2
                                                                 D3 matches 2 of 4 → #3
Apache Lucene EuroCon       20 May 2010
Summary
     Index post-processing covers a range of useful scenarios:
         Merging and splitting, remodeling, extracting, moving ...
         Pruning less important data

     Tiered search + pruned indexes:
         High performance
         Practically unchanged quality
         Less hardware

     Bitwise search:
         Filtering by matching bits
         Ranking by best matching patterns
Apache Lucene EuroCon   20 May 2010
Meta-summary
     Stir your imagination
     Think outside the box
     Show some unorthodox use and practical applications
     Close ties to scalability, performance, distributed search and
      query latency




Apache Lucene EuroCon   20 May 2010
Q&A




Apache Lucene EuroCon   20 May 2010
Thank you!




Apache Lucene EuroCon   05/25/10
Massive indexing with map-reduce
     Map-reduce indexing models
         Google model
         Nutch model
         Modified Nutch model
         Hadoop contrib/indexing model

     Tradeoff analysis and recommendations




Apache Lucene EuroCon   20 May 2010
Google model
   Map():                                               Reduce()

    IN: <seq, docText>                                    IN: <term, list(<seq,pos>)>
       terms = analyze(docText)                             foreach(<seq,pos>)
       foreach (term)                                         docId = calculate(seq, taskId)
            emit(term, <seq,position>)                         Postings(term).append(docId, pos)

       Pros: analysis on the map side
       Cons:
           Too many tiny intermediate records → Combiner
           DocID synchronization across map and reduce tasks
           Lucene: very difficult (impossible?) to create index this way
Apache Lucene EuroCon    20 May 2010
Nutch model (also in SOLR-1301)
   Map():                                             Reduce()

    IN: <seq, docPart>                                  IN: <docId, list(docPart)>
       docId = docPart.get(“url”)                         doc = luceneDoc(list(docPart))
       emit(docId, docPart)                               indexWriter.addDocument(doc)



       Pros: easy to build Lucene index
       Cons:
           Analysis on the reduce side
           Many costly merge operations (large indexes built from scratch on reduce side)
            (plus currently needs copy from local FS to HDFS – see LUCENE-2373)
Apache Lucene EuroCon   20 May 2010
Modified Nutch model (N/A...)
   Map():                                             Reduce()

    IN: <seq, docPart>                                  IN: <docId, list(<docPart,ts>)>
       docId = docPart.get(“url”)                         doc = luceneDoc(list(<docPart,ts>))
       ts = analyze(docPart)                              indexWriter.addDocument(doc)
       emit(docId, <docPart,ts>)

       Pros:
           Analysis on map side
           Easy to build Lucene index
       Cons:
           Many costly merge operations (large indexes built from scratch on reduce side)
            (plus currently needs copy from local FS to HDFS – see LUCENE-2373)
Apache Lucene EuroCon   20 May 2010
Hadoop contrib/indexing model
   Map():                                               Reduce()

    IN: <seq, docText>                                    IN: <random, list(indexData)>
       doc = luceneDoc(docText)                             foreach(indexData)
       indexWriter.addDocument(doc)                           indexWriter.addIndexes(indexData)
       emit(random, indexData)

       Pros:
           analysis on the map side
           Many merges on the map side
           Supports also other operations (deletes, updates)
       Cons:
           Serialization is costly, records are big and require more RAM to sort
Apache Lucene EuroCon   20 May 2010
Massive indexing - summary
     If you first need to collect document parts → SOLR-1301 model
     If you use complex analysis → Hadoop contrib/index
         NOTE: there is no good integration yet of Solr and Hadoop contrib/index module...




Apache Lucene EuroCon   20 May 2010

Mais conteúdo relacionado

Mais procurados

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache luceneShrikrishna Parab
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muirlucenerevolution
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 

Mais procurados (20)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Lucene
LuceneLucene
Lucene
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Azure search
Azure searchAzure search
Azure search
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 

Destaque

Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Inside Analysis
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneJosiane Gamgo
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
Lucandra
LucandraLucandra
Lucandraotisg
 
WSO2 Big Data Analytics Platform
WSO2 Big Data Analytics PlatformWSO2 Big Data Analytics Platform
WSO2 Big Data Analytics PlatformSamisa Abeysinghe
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
 
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Denodo
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionLucidworks (Archived)
 

Destaque (19)

Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Solr
SolrSolr
Solr
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Index types
Index typesIndex types
Index types
 
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Lucene
LuceneLucene
Lucene
 
Lucandra
LucandraLucandra
Lucandra
 
Inverted index
Inverted indexInverted index
Inverted index
 
WSO2 Big Data Analytics Platform
WSO2 Big Data Analytics PlatformWSO2 Big Data Analytics Platform
WSO2 Big Data Analytics Platform
 
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
 
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Using Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index ExplosionUsing Solr Cloud to Tame an Index Explosion
Using Solr Cloud to Tame an Index Explosion
 

Semelhante a Munching & crunching - Lucene index post-processing

Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NETSANKARSAN BOSE
 
Logging presentation
Logging presentationLogging presentation
Logging presentationJatan Malde
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJoseph Kuo
 
ITECH Kenya presentation on OpenMRS Developers Forum
ITECH Kenya presentation on OpenMRS Developers ForumITECH Kenya presentation on OpenMRS Developers Forum
ITECH Kenya presentation on OpenMRS Developers Forumdjazayeri
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...Maarten Balliauw
 
ICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code DescriptionsICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code DescriptionsSebastiano Panichella
 
Developing Actors in Azure with .net
Developing Actors in Azure with .netDeveloping Actors in Azure with .net
Developing Actors in Azure with .netMarco Parenzan
 
Easy Data Object Relational Mapping Tool
Easy Data Object Relational Mapping ToolEasy Data Object Relational Mapping Tool
Easy Data Object Relational Mapping ToolHasitha Guruge
 
Google Megastore
Google MegastoreGoogle Megastore
Google Megastorebergwolf
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...7mind
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

Semelhante a Munching & crunching - Lucene index post-processing (20)

Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
Binary Instance Loading
Binary Instance LoadingBinary Instance Loading
Binary Instance Loading
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Logging presentation
Logging presentationLogging presentation
Logging presentation
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
ITECH Kenya presentation on OpenMRS Developers Forum
ITECH Kenya presentation on OpenMRS Developers ForumITECH Kenya presentation on OpenMRS Developers Forum
ITECH Kenya presentation on OpenMRS Developers Forum
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
ICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code DescriptionsICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code Descriptions
 
Uml2
Uml2Uml2
Uml2
 
Developing Actors in Azure with .net
Developing Actors in Azure with .netDeveloping Actors in Azure with .net
Developing Actors in Azure with .net
 
Easy Data Object Relational Mapping Tool
Easy Data Object Relational Mapping ToolEasy Data Object Relational Mapping Tool
Easy Data Object Relational Mapping Tool
 
Memory models in c#
Memory models in c#Memory models in c#
Memory models in c#
 
Google Megastore
Google MegastoreGoogle Megastore
Google Megastore
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Azure Digital Twins
Azure Digital TwinsAzure Digital Twins
Azure Digital Twins
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

Último

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Último (20)

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

Munching & crunching - Lucene index post-processing

  • 1. 1 Munching & crunching Lucene index post-processing and applications Andrzej Białecki <andrzej.bialecki@lucidimagination.com>
  • 2. Intro  Started using Lucene in 2003 (1.2-dev?)  Created Luke – the Lucene Index Toolbox  Nutch, Hadoop committer, Lucene PMC member  Nutch project lead
  • 3. Munching and crunching? But really...  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency
  • 4. Agenda ● Post-processing ● Splitting, merging, sorting, pruning ● Tiered search ● Bit-wise search ● (Map-reduce indexing models) Apache Lucene EuroCon 20 May 2010
  • 5. Why post-process indexes?  Isn't it better to build them right from the start?  Sometimes it's not convenient or feasible  Correcting impact of unexpected common words  Targetting specific index size or composition:  Creating evenly-sized shards  Re-balancing shards across servers  Fitting indexes completely in RAM  … and sometimes impossible to do it right  Trimming index size while retaining quality of top-N results Apache Lucene EuroCon 20 May 2010
  • 6. Merging indexes  It's easy to merge several small indexes into one  Fundamental Lucene operation during indexing (SegmentMerger)  Command-line utilities exist: IndexMergeTool  API:  IndexWriter.addIndexes(IndexReader...)  IndexWriter.addIndexesNoOptimize(Directory...)  Hopefully a more flexible API on the flex branch  Solr: through CoreAdmin action=mergeindexes  Note: schema must be compatible Apache Lucene EuroCon 20 May 2010
  • 7. Splitting indexes original index segments_2  IndexSplitter tool:  Moves whole segments to standalone indexes _0 _1 _2  Pros: nearly no IO/CPU involved – just rename & create new SegmentInfos file Cons: segments_0 segments_0  segments_0  Requires a multi-segment index!  Very limited control over content of resulting indexes → MergePolicy new indexes Apache Lucene EuroCon 20 May 2010
  • 8. Splitting indexes, take 2 original index del2 del1 d1  MultiPassIndexSplitter tool: d2  Uses an IndexReader that keeps the list of deletions in memory  The source index remains unmodified d3  For each partition: d4  Marks all source documents not in the partition as deleted  Writes a target split using IndexWriter.addIndexes(IndexReader)  IndexWriter knows how to skip deleted documents  Removes the “deleted” mark from all source documents pass 1 pass 2  Pros: d1 d2  Arbitrary splits possible (even partially overlapping) d3 d4  Source index remains intact  Cons: new indexes  Reads complete index N times – I/O is O(N * indexSize)  Takes twice as much space (source index remains intact) … but maybe it's a feature?  Apache Lucene EuroCon 20 May 2010
  • 9. Splitting indexes, take 3 1 2 3 4 5 6 7 8 9 10 ... stored fields term dict  SinglePassSplitter postings+payloads  Uses the same processing workflow as term vectors SegmentMerger, only with multiple outputs  Write new SegmentInfos and FieldInfos 1 3 5… 1' 2' 3' 4' 5' 6'... stored  Merge (pass-through) stored fields terms partitioner  Merge (pass-through) term dictionary postings term vectors  Merge (pass-through) postings with payloads 246… 1' 2' 3' 4' 5' 6'...  Merge (pass-through) term vectors stored  Renumbers document id-s on-the-fly to form terms contiguous space postings term vectors  Pros: flexibility as with MultiPassIndexSplitter  Status: work started, to be contributed soon... renumber Apache Lucene EuroCon 20 May 2010
  • 10. Splitting indexes, summary  SinglePassSplitter – best tradeoff of flexibility/IO/CPU  Interesting scenarios with SinglePassSplitter:  Split by ranges, round-robin, by field value, by frequency, to a target size, etc...  “Extract” handful of documents to a separate index  “Move” documents between indexes:  “extract” from source  Add to target (merge)  Delete from source  Now the source index may reside on a network FS – the amount of IO is O(1 * indexSize) Apache Lucene EuroCon 20 May 2010
  • 11. Index sorting - introduction  “Early termination” technique  If full execution of a query takes too long then terminate and estimate  Termination conditions:  Number of documents – LimitedCollector in Nutch  Time – TimeLimitingCollector (see also extended LUCENE-1720 TimeLimitingIndexReader)  Problems:  Difficult to estimate total hits  Important docs may not be collected if they have high docID-s Apache Lucene EuroCon 20 May 2010
  • 12. Index sorting - details early termination == poor original index  Define a global ordering of 0 1 2 3 4 5 6 7 doc ID c e h f a d g b rank documents (e.g. PageRank, popularity, quality, etc)  Documents with good rank ID mapping should generally score higher 4 7 0 5 1 3 6 2 old doc ID 0 1 2 3 4 5 6 7 new doc ID  Sort (internal) ID-s by this ordering, descending sorted index  Map from old to new ID-s 0 1 2 3 4 5 6 7 doc ID to follow this ordering a b c d e f g h rank early termination == good  Change the ID-s in postings Apache Lucene EuroCon 20 May 2010
  • 13. Index sorting - summary  Implementation in Nutch: IndexSorter  Based on PageRank – sorts by decreasing page quality  Uses FilterIndexReader  NOTE: “Early termination” will (significantly) reduce quality of results with non-sorted indexes – use both or neither Apache Lucene EuroCon 20 May 2010
  • 14. Index pruning  Quick refresh on the index composition:  Stored fields  Term dictionary  Term frequency data  Positional data (postings)  With or without payload data  Term frequency vectors  Number of documents may be into millions  Number of terms commonly is well into millions  Not to mention individual postings … Apache Lucene EuroCon 20 May 2010
  • 15. Index pruning & top-N retrieval  N is usually << 1000  Very often search quality is judged based on top-20  Question:  Do we really need to keep and process ALL terms and ALL postings for a good-quality top-N search for common queries? Apache Lucene EuroCon 20 May 2010
  • 16. Index pruning hypothesis  There should be a way to remove some of the less important data  While retaining the quality of top-N results!  Question: what data is less important?  Some answers:  That of poorly-scoring documents  That of common (less selective) terms  Dynamic pruning: skips less relevant data during query processing → runtime cost...  But can we do this work in advance (static pruning)? Apache Lucene EuroCon 20 May 2010
  • 17. What do we need for top-N results?  Work backwards  “Foreach” common query:  Run it against the full index  Record the top-N matching documents  “Foreach” document in results:  Record terms and term positions that contributed to the score  Finally: remove all non-recorded postings and terms  First proposed by D. Carmel (2001) for single term queries Apache Lucene EuroCon 20 May 2010
  • 18. … but it's too simplistic: 0 quick 0 quick before pruning 1 brown 1 brown after pruning 2 fox 2 fox Query 1: brown - topN(full) == topN(pruned) Query 2: “brown fox” - topN(full) != topN(pruned)  Hmm, what about less common queries?  80/20 rule of “good enough”?  Term-level is too primitive  Document-centric pruning  Impact-centric pruning  Position-centric pruning Apache Lucene EuroCon 20 May 2010
  • 19. Smarter pruning Freq  Not all term positions are equally corpus language important model document language  Metrics of term and position model importance:  Plain in-document term frequency (TF)  TF-IDF score obtained from top-N results of TermQuery (Carmel method)  Residual IDF – measure of term informativeness (selectivity)  Key-phrase positions, or term clusters  Kullback-Leibler divergence from a Term language model → Apache Lucene EuroCon 20 May 2010
  • 20. Applications  Obviously, performance-related  Some papers claim a modest impact on quality when pruning up to 60% of postings  See LUCENE-1812 for some benchmarks confirming this claim  Removal / restructuring of (some) stored content  Legacy indexes, or ones created with a fossilized external chain Apache Lucene EuroCon 20 May 2010
  • 21. Stored field pruning  Some stored data can be compacted, removed, or restructured  Use case: source text for generating “snippets”  Split content into sentences  Reorder sentences by a static “importance” score (e.g. how many rare terms they contain)  NOTE: this may use collection wide statistics!  Remove the bottom x% of sentences Apache Lucene EuroCon 20 May 2010
  • 22. LUCENE-1812: contrib/pruning tools and API  Based on FilterIndexReader  Produces output indexes via IndexWriter.addIndexes(IndexReader[])  Design:  PruningReader – subclass of FilterIndexReader with necessary boilerplate and hooks for pruning policies  StorePruningPolicy – implements rules for modifying stored fields (and list of field names)  TermPruningPolicy – implements rules for modifying term dictionary, postings and payloads  PruningTool – command-line utility to configure and run PruningReader Apache Lucene EuroCon 20 May 2010
  • 23. Details of LUCENE-1812 source index target index stored fields StorePruningPolicy stored fields IndexWriter term dict term dict postings+payloads TermPruningPolicy postings+payloads term vectors term vectors PruningReader IW.addIndexes(IndexReader...)  IndexWriter consumes source data filtered via PruningReader  Internal document ID-s are preserved – suitable for bitset ops and retrieval by internal ID  If source index has no deletions  If target index is empty Apache Lucene EuroCon 20 May 2010
  • 24. API: StorePruningPolicy  May remove (some) fields from (some) documents  May as well modify the values  May rename / add fields Apache Lucene EuroCon 20 May 2010
  • 25. API: TermPruningPolicy  Thresholds (in the order of precedence):  Per term  Per field  Default  Plain TF pruning – TFTermPruningPolicy  Removes all postings for a term where TF (in-document term frequency) is below a threshold  Top-N term-level – CarmelTermPruningPolicy  TermQuery search for top-N docs  Removes all postings for a term outside the top-N docs Apache Lucene EuroCon 20 May 2010
  • 26. Results so far...  TF pruning:  Term query recall very good  Phrase query recall very poor – expected...  Carmel pruning – slightly better term position selection, but still heavy negative impact on phrase queries  Recognizing and keeping key phrases would help  Use query log for frequent-phrase mining?  Use collocation miner (Mahout)?  Savings on pruning will be smaller, but quality will significantly improve Apache Lucene EuroCon 20 May 2010
  • 27. References  Static Index Pruning for Information Retrieval Systems, Carmel et al, SIGIR'01  A document-centric approach to static index pruning in text retrieval systems, Büttcher & Clark, CIKM'06  Locality-based pruning methods for web search, deMoura et al, ACM TIS '08  Pruning strategies for mixed-mode querying, Anh & Moffat, CIKM'06 Apache Lucene EuroCon 20 May 2010
  • 28. Index pruning applied ...  Index 1: A heavily pruned index that fits in RAM:  excellent speed  poor search quality for many less-common query types  Index 2: Slightly pruned index that fits partially in RAM:  good speed, good quality for many common query types,  still poor quality for some other rare query types  Index 3: Full index on disk:  Slow speed  Excellent quality for all query types  QUESTION: Can we come up with a combined search strategy? Apache Lucene EuroCon 20 May 2010
  • 29. Tiered search search box 1 search box 1 RAM 70% pruned search box 2 search box 2 SSD 30% pruned ? predict evaluate search box 3 search box 3 HDD 0% pruned  Can we predict the best tier without actually running the query?  How to evaluate if the predictor was right? Apache Lucene EuroCon 20 May 2010
  • 30. Tiered search: tier selector and evaluator  Best tier can be predicted (often enough ):  Carmel pruning yields excellent results for simple term queries  Phrase-based pruning yields good results for phrase queries (though less often)  Quality evaluator: when is predictor wrong?  Could be very complex, based on gold standard and qrels  Could be very simple: acceptable number of results  Fall-back strategy:  Serial: poor latency, but minimizes load on bulkier tiers  Partially parallel:  submit to the next tier only the border-line queries  Pick the first acceptable answer – reduces latency Apache Lucene EuroCon 20 May 2010
  • 31. Tiered versus distributed  Both applicable to indexes and query loads exceeding single machine capabilities  Distributed sharded search:  increases latency for all queries (send + execute + integrate from all shards)  … plus replicas to increase QPS:  Increases hardware / management costs  While not improving latency  Tiered search:  Excellent latency for common queries  More complex to build and maintain  Arguably lower hardware cost for comparable scale / QPS Apache Lucene EuroCon 20 May 2010
  • 32. Tiered search benefits  Majority of common queries handled by first tier: RAM-based, high QPS, low latency  Partially parallel mode reduces average latency for more complex queries  Hardware investment likely smaller than for distributed search setup of comparable QPS / latency Apache Lucene EuroCon 20 May 2010
  • 33. Example Lucene API for tiered search Could be implemented as a Solr SearchComponent... Apache Lucene EuroCon 20 May 2010
  • 34. Lucene implementation details Apache Lucene EuroCon 20 May 2010
  • 35. References  Efficiency trade-offs in two-tier web search systems, Baeza- Yates et al., SIGIR'09  ResIn: A combination of results caching and index pruning for high-performance web search engines, Baeza-Yates et al, SIGIR'08  Three-level caching for efficient query processing in large Web search engines, Long & Suel, WWW'05 Apache Lucene EuroCon 20 May 2010
  • 36. Bit-wise search  Given a bit pattern query: 1010 1001 0101 0001  Find documents with matching bit patterns in a field  Applications:  Permission checking  De-duplication  Plagiarism detection  Two variants: non-scoring (filtering) and scoring Apache Lucene EuroCon 20 May 2010
  • 37. Non-scoring bitwise search (LUCENE-2460)  Builds a Filter from intersection of: 0 1 2 3 4 docID 0x01 0x02 0x03 0x04 0x05 flags  DocIdSet of documents matching a Query a b b a a type  Integer value and operation (AND, OR, XOR) “type:a”  “Value source” that caches integer values of a field (from FieldCache) 0x01 0x02 0x03 0x04 0x05 flags  Corresponding Solr field type and QParser: SOLR-1913 op=AND val=0x01  Useful for filtering (not scoring) Filter Apache Lucene EuroCon 20 May 2010
  • 38. Scoring bitwise search (SOLR-1918)  BooleanQuery in disguise: docID D1 D2 D3 flags 1010 1011 0011 1010 = Y-1000 | N-0100 | Y1000 Y1000 N1000 Y-0010 | N-0001 bits N0100 N0100 N0100 Y0010 Y0010 Y0010  Solr 32-bit BitwiseField N0001 Y0001 Y0001  Analyzer creates the bitmasks field  Currently supports only single value per field Q = bits:Y1000 bits:N0100 bits:Y0010 bits:N0001  Creates BooleanQuery from query int value Results:  Useful when searching for best matching (ranked) bit patterns D1 matches 4 of 4 → #1 D2 matches 3 of 4 → #2 D3 matches 2 of 4 → #3 Apache Lucene EuroCon 20 May 2010
  • 39. Summary  Index post-processing covers a range of useful scenarios:  Merging and splitting, remodeling, extracting, moving ...  Pruning less important data  Tiered search + pruned indexes:  High performance  Practically unchanged quality  Less hardware  Bitwise search:  Filtering by matching bits  Ranking by best matching patterns Apache Lucene EuroCon 20 May 2010
  • 40. Meta-summary  Stir your imagination  Think outside the box  Show some unorthodox use and practical applications  Close ties to scalability, performance, distributed search and query latency Apache Lucene EuroCon 20 May 2010
  • 42. Thank you! Apache Lucene EuroCon 05/25/10
  • 43. Massive indexing with map-reduce  Map-reduce indexing models  Google model  Nutch model  Modified Nutch model  Hadoop contrib/indexing model  Tradeoff analysis and recommendations Apache Lucene EuroCon 20 May 2010
  • 44. Google model  Map():  Reduce() IN: <seq, docText> IN: <term, list(<seq,pos>)>  terms = analyze(docText)  foreach(<seq,pos>)  foreach (term) docId = calculate(seq, taskId) emit(term, <seq,position>) Postings(term).append(docId, pos)  Pros: analysis on the map side  Cons:  Too many tiny intermediate records → Combiner  DocID synchronization across map and reduce tasks  Lucene: very difficult (impossible?) to create index this way Apache Lucene EuroCon 20 May 2010
  • 45. Nutch model (also in SOLR-1301)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(docPart)>  docId = docPart.get(“url”)  doc = luceneDoc(list(docPart))  emit(docId, docPart)  indexWriter.addDocument(doc)  Pros: easy to build Lucene index  Cons:  Analysis on the reduce side  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  • 46. Modified Nutch model (N/A...)  Map():  Reduce() IN: <seq, docPart> IN: <docId, list(<docPart,ts>)>  docId = docPart.get(“url”)  doc = luceneDoc(list(<docPart,ts>))  ts = analyze(docPart)  indexWriter.addDocument(doc)  emit(docId, <docPart,ts>)  Pros:  Analysis on map side  Easy to build Lucene index  Cons:  Many costly merge operations (large indexes built from scratch on reduce side) (plus currently needs copy from local FS to HDFS – see LUCENE-2373) Apache Lucene EuroCon 20 May 2010
  • 47. Hadoop contrib/indexing model  Map():  Reduce() IN: <seq, docText> IN: <random, list(indexData)>  doc = luceneDoc(docText)  foreach(indexData)  indexWriter.addDocument(doc) indexWriter.addIndexes(indexData)  emit(random, indexData)  Pros:  analysis on the map side  Many merges on the map side  Supports also other operations (deletes, updates)  Cons:  Serialization is costly, records are big and require more RAM to sort Apache Lucene EuroCon 20 May 2010
  • 48. Massive indexing - summary  If you first need to collect document parts → SOLR-1301 model  If you use complex analysis → Hadoop contrib/index  NOTE: there is no good integration yet of Solr and Hadoop contrib/index module... Apache Lucene EuroCon 20 May 2010