SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Is Your IndexReader Really
  Atomic or Maybe Slow?
         Uwe Schindler
    SD DataSolutions GmbH,
 uschindler@sd-datasolutions.de

                                  1
My Background
• I am committer and PMC member of Apache Lucene and Solr. My
  main focus is on development of Lucene Java.
• Implemented fast numerical search and maintaining the new
  attribute-based text analysis API. Well known as Generics and
  Sophisticated Backwards Compatibility Policeman.
• Working as consultant and software architect for SD DataSolutions
  GmbH in Bremen, Germany. The main task is maintaining PANGAEA
  (Publishing Network for Geoscientific & Environmental Data) where
  I implemented the portal's geo-spatial retrieval functions with
  Apache Lucene Core.
• Talks about Lucene at various international conferences like the
  previous Lucene Revolution, Lucene Eurocon, ApacheCon EU/US,
  Berlin Buzzwords, and various local meetups.

                                                                      2
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   3
Lucene Index Structure
• Lucene was the first full text search engine
  that supported document additions and
  updates
• Snapshot isolation ensures consistency


          Segmented index structure
  Committing changes creates new segments
                                                 4
Segments in Lucene




                      • Each index consists of various segments
                        placed in the index directory. All documents
                        are added to new new segment files, merged with other
                        on-disk files after flushing.
Image: Doug Cutting




                                   *) The term “optimal” does not mean all indexes must be optimized!
                                                                                                  5
Segments in Lucene




                      • Each index consists of various segments
                        placed in the index directory. All documents
                        are added to new new segment files, merged with other
                        on-disk files after flushing.
                      • Lucene writes segments incrementally and can merge
                        them.
Image: Doug Cutting




                                   *) The term “optimal” does not mean all indexes must be optimized!
                                                                                                  6
Segments in Lucene




                      • Each index consists of various segments
                        placed in the index directory. All documents
                        are added to new new segment files, merged with other
                        on-disk files after flushing.
                      • Lucene writes segments incrementally and can merge
                        them.
Image: Doug Cutting




                      • Optimized* index consists of one segment.
                                   *) The term “optimal” does not mean all indexes must be optimized!
                                                                                                  7
Lucene merges while indexing
   all of English Wikipedia




                                    Video by courtesy of:
                                                    8
                Mike McCandless’ blog, http://goo.gl/kI53f
Indexes in Lucene (up to version 3.6)
• Each segment (“atomic index”) is a completely
  functional index:
  – SegmentReader implements the IndexReader
    interface for single segments
• Composite indexes
  – DirectoryReader implements the IndexReader
    interface on top of a set of SegmentReaders
  – MultiReader is an abstraction of multiple
    IndexReaders combined to one virtual index

                                                  9
Composite Indexes (up to version 3.6)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate global
    document IDs -> segment document IDs (binary
    search)
  – FieldCache: duplicate instances for single segments
    and composite view (memory!!!)
                                                          10
Merging Term Index and Postings
                         Term 1

Term 1   Term 1          Term 2

Term 2   Term 3          Term 3

Term 4   Term 5          Term 4

Term 7   Term 6          Term 5

Term 8   Term 8          Term 6

Term 9   Term 9          Term 7
                         Term 8
                         Term 9


                                   11
Merging Term Index and Postings
                         Term 1

Term 1   Term 1          Term 2

Term 2   Term 3          Term 3

Term 4   Term 5          Term 4

Term 7   Term 6          Term 5

Term 8   Term 8          Term 6

Term 9   Term 9          Term 7
                         Term 8
                         Term 9


                                   12
Composite Indexes (up to version 3.6)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate
    global document IDs -> segment document IDs
    (binary search)
  – FieldCache: duplicate instances for single
    segments and composite view (memory!!!)
                                                   13
Searching before Version 2.9
• IndexSearcher used the underlying index
  always as a “single” (atomic) index:
  – Queries are executed on the atomic view of a
    composite index
  – Slowdown for queries that scan term dictionary
    (MultiTermQuery) or hit lots of documents,
    facetting




                                                          Image: © The Walt Disney Company
    => recommendation to “optimize” index
  – On every index change, FieldCache used for
    sorting had to be reloaded completely

                                                     14
Lucene 2.9 and later:
             Per segment search

Search is executed separately on each index segment:




             Atomic view no longer used!
                                                       15
Per Segment Search: Pros
• No live term dictionary merging
• Possibility to parallelize
   – ExecutorService in IndexSearcher since Lucene 3.1+
   – Do not optimize to make this work!

• Sorting only needs per-segment FieldCache
   – Cheap reopen after index changes!

• Filter cache per segment
   – Cheap reopen after index changes!
                                                          16
Per Segment Search: Cons
• Query/Filter API changes:
  – Scorer / Filter’s DocIdSet no longer use global
    document IDs
• Slower sorting by string terms
  – Term ords are only comparable inside each
    segment
  – String comparisons needed after segment
    traversal
  – Use numeric sorting if possible!!!
    (Lucene supports missing values since version 3.4 [buggy], corrected 3.5+)

                                                                                 17
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   18
Composite Indexes (up to version 3.6)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate global
    document IDs -> segment document IDs (binary
    search)
  – FieldCache: duplicate instances for single segments
    and composite view (memory!!!)
                                                          19
Composite Indexes (version 4.0)
Atomic “views” on multi-segment index:
  – Term Dictionary: on-the-fly merged & sorted term
    index (priority queue for TermEnum,…)
  – Postings: posting lists for each term appended,
    convert document IDs to be global
  – Metadata: doc frequency, term frequency,…
  – Stored fields, term vectors, deletions: delegate global
    document IDs -> segment document IDs (binary
    search)
  – FieldCache: duplicate instances for single segments
    and composite view (memory!!!)
                                                          20
Early Lucene 4.0
                                   • Only “historic” IndexReader interface
                                     available since Lucene 1.0
                                   • 80% of all methods of composite IndexReaders
                                     throwed UnsupportedOperationException
                                     – This affected all user-facing APIs
                                       (SegmentReader was hidden / marked experimental)
                                   • No compile time safety!
Image: © The Walt Disney Company




                                     – Query Scorers and Filters need term dictionary and
                                       postings, throwing UOE when executed on composite
                                       reader


                                                                                          21
Image: © The Walt Disney Company




                                   Heavy Committing™ !!!




 22
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)




                                                               23
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!




                                                               24
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!




                                                                25
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…




                                                                26
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…

•   Unmodifiable: No more deletions / changing of norms possible!
     – Applies to all readers in Lucene 4.0 !!!




                                                                    27
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…

•   Unmodifiable: No more deletions / changing of norms possible!
     – Applies to all readers in Lucene 4.0 !!!

•   Passed to IndexSearcher
     – just as before!

                                                                    28
IndexReader
•   Most generic (abstract) API to an index
     – superclass of more specific types
     – cannot be subclassed directly (no public constructor)

•   Does not know anything about “inverted index” concept
     – no terms, no postings!

•   Access to stored fields (and term vectors) by document ID
     – to display / highlight search results only!

•   Some very limited metadata
     – number of documents,…

•   Unmodifiable: No more deletions / changing of norms possible!
     – Applies to all readers in Lucene 4.0 !!!

•   Passed to IndexSearcher
     – just as before!

•   Allows IndexReader.open() for backwards compatibility (deprecated)   29
AtomicReader
• Inherits from IndexReader




                              30
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)




                                                 31
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)
• Full term dictionary and postings API




                                                 32
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)
• Full term dictionary and postings API




                                                 33
AtomicReader
• Inherits from IndexReader
• Access to “atomic” indexes (single segments)
• Full term dictionary and postings API
• Access to DocValues (new in Lucene 4.0) and
  norms



                                                 34
CompositeReader
• No additional functionality on top of
  IndexReader




                                          35
CompositeReader
• No additional functionality on top of
  IndexReader
• Provides getSequentialSubReaders() to
  retrieve all child readers




                                          36
CompositeReader
• No additional functionality on top of
  IndexReader
• Provides getSequentialSubReaders() to
  retrieve all child readers
• DirectoryReader and MultiReader implement
  this class



                                          37
DirectoryReader
• Abstract class, defines interface for:




                                           38
DirectoryReader
• Abstract class, defines interface for:
   – access to on-disk indexes (on top of Directory class)
   – access to commit points, index metadata, index
     version, isCurrent() for reopen support
   – defines abstract openIfChanged (for cheap reopening
     of indexes)
   – child readers are always AtomicReader instances




                                                         39
DirectoryReader
• Abstract class, defines interface for:
   – access to on-disk indexes (on top of Directory class)
   – access to commit points, index metadata, index version,
     isCurrent() for reopen support
   – defines abstract openIfChanged (for cheap reopening of
     indexes)
   – child readers are always AtomicReader instances
• Provides static factory methods for opening indexes
   – well-known from IndexReader in Lucene 1 to 3
   – factories return internal DirectoryReader implementation
     (StandardDirectoryReader with SegmentReaders as childs)

                                                               40
Basic Search Example
Image: © The Walt Disney Company




                                       Looks familiar, doesn’t it?

                                                                     41
“Arrgh! I still need terms and postings on my
DirectoryReader!!! What should I do? Optimize
to only have one segment? Help me please!!!”
          (example question on java-user@lucene.apache.org)




                                                              42
What to do?

• Calm down
• Take a break and drink a beer*
• Don’t optimize / force merge your index!!!




     *) Robert’s solution                      43
Solutions
• Most efficient way:
  – Retrieve atomic leaves from your composite:
    reader.getTopReaderContext().leaves()
  – Iterate over sub-readers, do the work
    (possibly parallelized)
  – Merge results




                                                  44
Solutions
• Most efficient way:
  – Retrieve atomic leaves from your composite:
    reader.getTopReaderContext().leaves()
  – Iterate over sub-readers, do the work
    (possibly parallelized)
  – Merge results


• Otherwise: wrap your CompositeReader:

                                                  45
(Slow) Solution

AtomicReader
  SlowCompositeReaderWrapper.wrap(IndexReader r)


• Wraps IndexReaders of any kind as atomic reader,
  providing terms, postings, deletions, doc values
   – Internally uses same algorithms like previous Lucene readers
   – Segment-merging uses this to merge segments, too




                                                                    46
(Slow) Solution

AtomicReader
  SlowCompositeReaderWrapper.wrap(IndexReader r)


• Wraps IndexReaders of any kind as atomic reader,
  providing terms, postings, deletions, doc values
   – Internally uses same algorithms like previous Lucene readers
   – Segment-merging uses this to merge segments, too


• Solr always provides an AtomicReader for convenience
  through SolrIndexSearcher. Plugin writers should use:
AtomicReader
  rd = mySolrIndexSearcher.getAtomicReader()
                                                                    47
Other readers
• FilterAtomicReader
  – Was FilterIndexReader in 3.x
    (but now solely works on atomic readers)
  – Allows to filter terms, postings, deletions
  – Useful for index splitters (e.g., PKIndexSplitter,
    MultiPassIndexSplitter)
    (provide own getLiveDocs() method, merge to IndexWriter)

• ParallelAtomicReader, -CompositeReader


                                                               48
IndexReader Reopening
• Reopening solely provided by directory-based
  DirectoryReader instances




                                             49
IndexReader Reopening
• Reopening solely provided by directory-based
  DirectoryReader instances
• No more reopen for:
  – AtomicReader: they are atomic, no refresh possible
  – MultiReader: reopen child readers separately, create
    new MultiReader on top of reopened readers
  – Parallel*Reader, FilterAtomicReader: reopen
    wrapped readers, create new wrapper afterwards


                                                           50
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   51
IndexReaderContext
AtomicReaderContext
CompositeReaderContext


WTF ?!?
                         52
The Problem
              SuperComposite: MultiReader


    CompositeA:                     CompositeB:
   DirectoryReader                DirectoryReader
Atomic0   Atomic1   Atomic2    Atomic0   Atomic1   Atomic2

Doc0      Doc0      Doc0        Doc0     Doc0      Doc0
Doc1      Doc1      Doc1        Doc1     Doc1      Doc1
Doc2      Doc2      Doc2        Doc2     Doc2      Doc2
Doc3      Doc3      Doc3        Doc3     Doc3      Doc3




                                                             53
The Problem
              SuperComposite: MultiReader


    CompositeA:                     CompositeB:
   DirectoryReader                DirectoryReader
Atomic0   Atomic1   Atomic2    Atomic0   Atomic1   Atomic2

Doc0      Doc4      Doc8        Doc0     Doc0      Doc0
Doc1      Doc5      Doc9        Doc1     Doc1      Doc1
Doc2      Doc6      Doc10       Doc2     Doc2      Doc2
Doc3      Doc7      Doc11       Doc3     Doc3      Doc3




                                                             54
The Problem
              SuperComposite: MultiReader


    CompositeA:                     CompositeB:
   DirectoryReader                DirectoryReader
Atomic0   Atomic1   Atomic2    Atomic0   Atomic1   Atomic2

Doc0      Doc4      Doc8        Doc12    Doc16     Doc20

Doc1      Doc5      Doc9        Doc13    Doc17     Doc21

Doc2      Doc6      Doc10       Doc14    Doc18     Doc22

Doc3      Doc7      Doc11       Doc15    Doc19     Doc23




                                                             55
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()




                                             56
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct
  descendants) and leaves (down to lowest level
  AtomicReaders)




                                                        57
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct
  descendants) and leaves (down to lowest level
  AtomicReaders)
• Each atomic leave has a docBase (document ID
  offset)




                                                    58
Solution
• Each IndexReader instance provides:
  IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct
  descendants) and leaves (down to lowest level
  AtomicReaders)
• Each atomic leave has a docBase (document ID offset)
• IndexSearcher passes a context instance relative to its
  own top-level reader to each Query-Scorer / Filter
   – allows to access complete reader tree up to the current top-
     level reader
   – Allows to get the “global” document ID
                                                              59
Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

                                   60
Summary


• Up to Lucene 3.6 indexes are still accessible on
  any hierarchy level with same interface
• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     61
Summary
• Lucene moved from global searches to per-
  segment search in Lucene 2.9
• Up to Lucene 3.6 indexes are still accessible on
  any hierarchy level with same interface
• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     62
Summary


• Up to Lucene 3.6 indexes are still accessible on
  any hierarchy level with same interface
• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     63
Summary




• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     64
Summary




• Lucene 4.0 will split the IndexReader class into
  several abstract interfaces
• IndexReaderContexts will support per-segment
  search preserving top-level document IDs
                                                     65
Contact
                  Uwe Schindler
                uschindler@apache.org
                http://www.thetaphi.de
                        @thetaph1




    SD DataSolutions GmbH
         Wätjenstr. 49
    28213 Bremen, Germany
      +49 421 40889785-0
http://www.sd-datasolutions.de

                                         66

Mais conteúdo relacionado

Mais procurados

le guide swebok
le guide swebokle guide swebok
le guide swebok
sammiiaa
 
Søren nielsen ihc pitfalls - leica 2011
Søren nielsen ihc pitfalls - leica 2011Søren nielsen ihc pitfalls - leica 2011
Søren nielsen ihc pitfalls - leica 2011
Leica-Microsystems
 

Mais procurados (20)

Muduo network library
Muduo network libraryMuduo network library
Muduo network library
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
MongoDB Overview
MongoDB OverviewMongoDB Overview
MongoDB Overview
 
Schema on read with runtime fields
Schema on read with runtime fieldsSchema on read with runtime fields
Schema on read with runtime fields
 
le guide swebok
le guide swebokle guide swebok
le guide swebok
 
Selenium sunum
Selenium sunumSelenium sunum
Selenium sunum
 
Doctorat bouderba
Doctorat bouderbaDoctorat bouderba
Doctorat bouderba
 
Søren nielsen ihc pitfalls - leica 2011
Søren nielsen ihc pitfalls - leica 2011Søren nielsen ihc pitfalls - leica 2011
Søren nielsen ihc pitfalls - leica 2011
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
evaluacion ecografica del hombro. ultrasonido musculoesqueletico
evaluacion ecografica del hombro. ultrasonido musculoesqueleticoevaluacion ecografica del hombro. ultrasonido musculoesqueletico
evaluacion ecografica del hombro. ultrasonido musculoesqueletico
 
Einführung in Suchmaschinen und Solr
Einführung in Suchmaschinen und SolrEinführung in Suchmaschinen und Solr
Einführung in Suchmaschinen und Solr
 
Vagrant e Docker a confronto;scegliere ed iniziare
Vagrant e  Docker a confronto;scegliere ed iniziareVagrant e  Docker a confronto;scegliere ed iniziare
Vagrant e Docker a confronto;scegliere ed iniziare
 
01 la programmation batch - les debuts
01   la programmation batch - les debuts01   la programmation batch - les debuts
01 la programmation batch - les debuts
 
Linux FEDORA
Linux FEDORALinux FEDORA
Linux FEDORA
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Getting started with DSpace 7 REST API
Getting started with DSpace 7 REST APIGetting started with DSpace 7 REST API
Getting started with DSpace 7 REST API
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data Models
 
High Performance JavaScript - WebDirections USA 2010
High Performance JavaScript - WebDirections USA 2010High Performance JavaScript - WebDirections USA 2010
High Performance JavaScript - WebDirections USA 2010
 
آموزش سیستم های عامل - بخش سوم
آموزش سیستم های عامل - بخش سومآموزش سیستم های عامل - بخش سوم
آموزش سیستم های عامل - بخش سوم
 
The Cybercriminal Underground: Understanding and categorising criminal market...
The Cybercriminal Underground: Understanding and categorising criminal market...The Cybercriminal Underground: Understanding and categorising criminal market...
The Cybercriminal Underground: Understanding and categorising criminal market...
 

Destaque

Proposal for nested document support in Lucene
Proposal for nested document support in LuceneProposal for nested document support in Lucene
Proposal for nested document support in Lucene
Mark Harwood
 

Destaque (7)

Lucene with Bloom filtered segments
Lucene with Bloom filtered segmentsLucene with Bloom filtered segments
Lucene with Bloom filtered segments
 
Patterns for large scale search
Patterns for large scale searchPatterns for large scale search
Patterns for large scale search
 
Lucene KV-Store
Lucene KV-StoreLucene KV-Store
Lucene KV-Store
 
Proposal for nested document support in Lucene
Proposal for nested document support in LuceneProposal for nested document support in Lucene
Proposal for nested document support in Lucene
 
Grouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/SolrGrouping and Joining in Lucene/Solr
Grouping and Joining in Lucene/Solr
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Scaling Solr with Solr Cloud
Scaling Solr with Solr CloudScaling Solr with Solr Cloud
Scaling Solr with Solr Cloud
 

Semelhante a Is Your Index Reader Really Atomic or Maybe Slow?

Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Introduction to couchbase
Introduction to couchbaseIntroduction to couchbase
Introduction to couchbase
Dipti Borkar
 

Semelhante a Is Your Index Reader Really Atomic or Maybe Slow? (20)

Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Elasticsearch Architechture
Elasticsearch ArchitechtureElasticsearch Architechture
Elasticsearch Architechture
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elastic search
Elastic searchElastic search
Elastic search
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Glusterfs session #9 index xlator
Glusterfs session #9   index xlatorGlusterfs session #9   index xlator
Glusterfs session #9 index xlator
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Introduction to couchbase
Introduction to couchbaseIntroduction to couchbase
Introduction to couchbase
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
 
Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.
 
IntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and PerformanceIntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and Performance
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 

Mais de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Is Your Index Reader Really Atomic or Maybe Slow?

  • 1. Is Your IndexReader Really Atomic or Maybe Slow? Uwe Schindler SD DataSolutions GmbH, uschindler@sd-datasolutions.de 1
  • 2. My Background • I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. • Implemented fast numerical search and maintaining the new attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility Policeman. • Working as consultant and software architect for SD DataSolutions GmbH in Bremen, Germany. The main task is maintaining PANGAEA (Publishing Network for Geoscientific & Environmental Data) where I implemented the portal's geo-spatial retrieval functions with Apache Lucene Core. • Talks about Lucene at various international conferences like the previous Lucene Revolution, Lucene Eurocon, ApacheCon EU/US, Berlin Buzzwords, and various local meetups. 2
  • 3. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 3
  • 4. Lucene Index Structure • Lucene was the first full text search engine that supported document additions and updates • Snapshot isolation ensures consistency  Segmented index structure  Committing changes creates new segments 4
  • 5. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. Image: Doug Cutting *) The term “optimal” does not mean all indexes must be optimized! 5
  • 6. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. • Lucene writes segments incrementally and can merge them. Image: Doug Cutting *) The term “optimal” does not mean all indexes must be optimized! 6
  • 7. Segments in Lucene • Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing. • Lucene writes segments incrementally and can merge them. Image: Doug Cutting • Optimized* index consists of one segment. *) The term “optimal” does not mean all indexes must be optimized! 7
  • 8. Lucene merges while indexing all of English Wikipedia Video by courtesy of: 8 Mike McCandless’ blog, http://goo.gl/kI53f
  • 9. Indexes in Lucene (up to version 3.6) • Each segment (“atomic index”) is a completely functional index: – SegmentReader implements the IndexReader interface for single segments • Composite indexes – DirectoryReader implements the IndexReader interface on top of a set of SegmentReaders – MultiReader is an abstraction of multiple IndexReaders combined to one virtual index 9
  • 10. Composite Indexes (up to version 3.6) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 10
  • 11. Merging Term Index and Postings Term 1 Term 1 Term 1 Term 2 Term 2 Term 3 Term 3 Term 4 Term 5 Term 4 Term 7 Term 6 Term 5 Term 8 Term 8 Term 6 Term 9 Term 9 Term 7 Term 8 Term 9 11
  • 12. Merging Term Index and Postings Term 1 Term 1 Term 1 Term 2 Term 2 Term 3 Term 3 Term 4 Term 5 Term 4 Term 7 Term 6 Term 5 Term 8 Term 8 Term 6 Term 9 Term 9 Term 7 Term 8 Term 9 12
  • 13. Composite Indexes (up to version 3.6) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 13
  • 14. Searching before Version 2.9 • IndexSearcher used the underlying index always as a “single” (atomic) index: – Queries are executed on the atomic view of a composite index – Slowdown for queries that scan term dictionary (MultiTermQuery) or hit lots of documents, facetting Image: © The Walt Disney Company => recommendation to “optimize” index – On every index change, FieldCache used for sorting had to be reloaded completely 14
  • 15. Lucene 2.9 and later: Per segment search Search is executed separately on each index segment: Atomic view no longer used! 15
  • 16. Per Segment Search: Pros • No live term dictionary merging • Possibility to parallelize – ExecutorService in IndexSearcher since Lucene 3.1+ – Do not optimize to make this work! • Sorting only needs per-segment FieldCache – Cheap reopen after index changes! • Filter cache per segment – Cheap reopen after index changes! 16
  • 17. Per Segment Search: Cons • Query/Filter API changes: – Scorer / Filter’s DocIdSet no longer use global document IDs • Slower sorting by string terms – Term ords are only comparable inside each segment – String comparisons needed after segment traversal – Use numeric sorting if possible!!! (Lucene supports missing values since version 3.4 [buggy], corrected 3.5+) 17
  • 18. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 18
  • 19. Composite Indexes (up to version 3.6) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 19
  • 20. Composite Indexes (version 4.0) Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term index (priority queue for TermEnum,…) – Postings: posting lists for each term appended, convert document IDs to be global – Metadata: doc frequency, term frequency,… – Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search) – FieldCache: duplicate instances for single segments and composite view (memory!!!) 20
  • 21. Early Lucene 4.0 • Only “historic” IndexReader interface available since Lucene 1.0 • 80% of all methods of composite IndexReaders throwed UnsupportedOperationException – This affected all user-facing APIs (SegmentReader was hidden / marked experimental) • No compile time safety! Image: © The Walt Disney Company – Query Scorers and Filters need term dictionary and postings, throwing UOE when executed on composite reader 21
  • 22. Image: © The Walt Disney Company Heavy Committing™ !!! 22
  • 23. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) 23
  • 24. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! 24
  • 25. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! 25
  • 26. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… 26
  • 27. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… • Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! 27
  • 28. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… • Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! • Passed to IndexSearcher – just as before! 28
  • 29. IndexReader • Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor) • Does not know anything about “inverted index” concept – no terms, no postings! • Access to stored fields (and term vectors) by document ID – to display / highlight search results only! • Some very limited metadata – number of documents,… • Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!! • Passed to IndexSearcher – just as before! • Allows IndexReader.open() for backwards compatibility (deprecated) 29
  • 31. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) 31
  • 32. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) • Full term dictionary and postings API 32
  • 33. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) • Full term dictionary and postings API 33
  • 34. AtomicReader • Inherits from IndexReader • Access to “atomic” indexes (single segments) • Full term dictionary and postings API • Access to DocValues (new in Lucene 4.0) and norms 34
  • 35. CompositeReader • No additional functionality on top of IndexReader 35
  • 36. CompositeReader • No additional functionality on top of IndexReader • Provides getSequentialSubReaders() to retrieve all child readers 36
  • 37. CompositeReader • No additional functionality on top of IndexReader • Provides getSequentialSubReaders() to retrieve all child readers • DirectoryReader and MultiReader implement this class 37
  • 38. DirectoryReader • Abstract class, defines interface for: 38
  • 39. DirectoryReader • Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening of indexes) – child readers are always AtomicReader instances 39
  • 40. DirectoryReader • Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening of indexes) – child readers are always AtomicReader instances • Provides static factory methods for opening indexes – well-known from IndexReader in Lucene 1 to 3 – factories return internal DirectoryReader implementation (StandardDirectoryReader with SegmentReaders as childs) 40
  • 41. Basic Search Example Image: © The Walt Disney Company Looks familiar, doesn’t it? 41
  • 42. “Arrgh! I still need terms and postings on my DirectoryReader!!! What should I do? Optimize to only have one segment? Help me please!!!” (example question on java-user@lucene.apache.org) 42
  • 43. What to do? • Calm down • Take a break and drink a beer* • Don’t optimize / force merge your index!!! *) Robert’s solution 43
  • 44. Solutions • Most efficient way: – Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves() – Iterate over sub-readers, do the work (possibly parallelized) – Merge results 44
  • 45. Solutions • Most efficient way: – Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves() – Iterate over sub-readers, do the work (possibly parallelized) – Merge results • Otherwise: wrap your CompositeReader: 45
  • 46. (Slow) Solution AtomicReader SlowCompositeReaderWrapper.wrap(IndexReader r) • Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too 46
  • 47. (Slow) Solution AtomicReader SlowCompositeReaderWrapper.wrap(IndexReader r) • Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too • Solr always provides an AtomicReader for convenience through SolrIndexSearcher. Plugin writers should use: AtomicReader rd = mySolrIndexSearcher.getAtomicReader() 47
  • 48. Other readers • FilterAtomicReader – Was FilterIndexReader in 3.x (but now solely works on atomic readers) – Allows to filter terms, postings, deletions – Useful for index splitters (e.g., PKIndexSplitter, MultiPassIndexSplitter) (provide own getLiveDocs() method, merge to IndexWriter) • ParallelAtomicReader, -CompositeReader 48
  • 49. IndexReader Reopening • Reopening solely provided by directory-based DirectoryReader instances 49
  • 50. IndexReader Reopening • Reopening solely provided by directory-based DirectoryReader instances • No more reopen for: – AtomicReader: they are atomic, no refresh possible – MultiReader: reopen child readers separately, create new MultiReader on top of reopened readers – Parallel*Reader, FilterAtomicReader: reopen wrapped readers, create new wrapper afterwards 50
  • 51. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 51
  • 53. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReader Atomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2 Doc0 Doc0 Doc0 Doc0 Doc0 Doc0 Doc1 Doc1 Doc1 Doc1 Doc1 Doc1 Doc2 Doc2 Doc2 Doc2 Doc2 Doc2 Doc3 Doc3 Doc3 Doc3 Doc3 Doc3 53
  • 54. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReader Atomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2 Doc0 Doc4 Doc8 Doc0 Doc0 Doc0 Doc1 Doc5 Doc9 Doc1 Doc1 Doc1 Doc2 Doc6 Doc10 Doc2 Doc2 Doc2 Doc3 Doc7 Doc11 Doc3 Doc3 Doc3 54
  • 55. The Problem SuperComposite: MultiReader CompositeA: CompositeB: DirectoryReader DirectoryReader Atomic0 Atomic1 Atomic2 Atomic0 Atomic1 Atomic2 Doc0 Doc4 Doc8 Doc12 Doc16 Doc20 Doc1 Doc5 Doc9 Doc13 Doc17 Doc21 Doc2 Doc6 Doc10 Doc14 Doc18 Doc22 Doc3 Doc7 Doc11 Doc15 Doc19 Doc23 55
  • 56. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() 56
  • 57. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() • The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) 57
  • 58. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() • The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) • Each atomic leave has a docBase (document ID offset) 58
  • 59. Solution • Each IndexReader instance provides: IndexReaderContext getTopReaderContext() • The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders) • Each atomic leave has a docBase (document ID offset) • IndexSearcher passes a context instance relative to its own top-level reader to each Query-Scorer / Filter – allows to access complete reader tree up to the current top- level reader – Allows to get the “global” document ID 59
  • 60. Agenda • Motivation / History of Lucene • AtomicReader & CompositeReader • Reader contexts • Wrap up 60
  • 61. Summary • Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 61
  • 62. Summary • Lucene moved from global searches to per- segment search in Lucene 2.9 • Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 62
  • 63. Summary • Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 63
  • 64. Summary • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 64
  • 65. Summary • Lucene 4.0 will split the IndexReader class into several abstract interfaces • IndexReaderContexts will support per-segment search preserving top-level document IDs 65
  • 66. Contact Uwe Schindler uschindler@apache.org http://www.thetaphi.de @thetaph1 SD DataSolutions GmbH Wätjenstr. 49 28213 Bremen, Germany +49 421 40889785-0 http://www.sd-datasolutions.de 66