Presented by Andrzej Bialecki, LucidWorks
This session presents a set of Solr components for easy management of "sidecar indexes" - indexes that extend the main index with additional stored and / or indexed fields. Conceptually this can be viewed as an extension of the ExternalFileField or as a static join between documents from two collections. This functionality is useful in applications that require very different update regimes for the two parts of the index (e.g. main catalogue items combined with clickthroughs).
3. About the speaker
•
•
•
•
Started using Lucene in 2003 (1.2-dev…)
Created Luke – the Lucene Index Toolbox
Apache Nutch, Hadoop, Solr committer, Lucene PMC member
LucidWorks engineer
5. Challenge: incremental document updates
•
•
•
Incremental update (field-level update): modification of a part of document
Sounds like a fundamentally useful functionality!
But Lucene / Solr doesn’t offer true field-level updates (yet!)
– “Update” is really a sequence of “retrieve old document, update fields, add
updated document, delete old document”
– “Atomic update” functionality in Solr is a (useful) syntactic sugar
6. Common use cases for field updates
•
•
•
Documents composed logically of two parts with different update schedules
– E.g. mostly static documents with some quickly changing fields
Two different classes of data in changing fields
– Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns
– Text fields: e.g. reviews, tags, click-through feedback, user profiles
Challenge: how to integrate these modifications with the main index content?
– Re-indexing whole documents isn’t always an option
7. True full-text (inverted fields) incremental updates
•
•
Very complex issue, broad impact on many Lucene internals
– Inverted index structure is not optimized for partial document updates
– At least another 6-12 months away?
LUCENE-4258 – work in progress
8. Handling updates via full re-index
•
•
•
If the corpus is small, or incremental updates infrequent… just re-index everything!
Pros:
– Relatively easy to implement – update source documents and re-index
– Allows adding all types of data, including e.g. labels as searchable text
Cons:
– Infeasible for larger corpora or frequent updates, time-wise and cost-wise
– Requires keeping around the source documents
• Sometimes inconvenient, when documents are assembled in a complex
pipeline
9. Handling updates via Solr’s ExternalFileField
•
•
•
Pros:
– Simple to implement
– Updates are easy – just file edits, no need to re-index
Cons:
– Only docId => field : number
– Not suitable for full-text searchable field updates
• E.g. can’t support user-generated labels attached to a
doc
– Still useful if a simple “popularity”-type metric is sufficient
Internally implemented as an in-memory ValueSource usable by
function queries
doc0=1.5
doc1=2.5
doc2=0.5
…
10. Numeric DocValues updates
•
Since Lucene/Solr 4.6 … to be released Really Soon
•
•
Details can be found in LUCENE-5189
As simple as:
indexWriter.updateNumericDocValue(term, field, value)
•
•
•
Neatly solves the problem of numeric updates: popularity, in-stock, etc.
Some limitations:
– Massive updates still somewhat costly until the next merge (like deletes)
– Can only update existing fields
Obviously doesn’t address the full-text inverted field updates
11. Lucene ParallelReader overview
•
•
•
0
Pretends that two or more IndexReader-s are
slices of the same index
– Slices contain data for different fields
– Both stored and inverted parts are supported
– Data for matching docs is joined on the fly
Structure of all indexes MUST match 1:1 !!!
– The same number of segments
– The same count of docs per segment
– Internal document ID-s must match 1:1
– List of deletes is taken from the first index
Sounds cool … but in practice it’s rarely used:
– It’s very difficult to meet these requirements
– This is even more difficult in the presence of
index updates and merges
f1, f2, f3, f4…
ParallelReader
0
1
2
3
0
1
2
3
f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …
0
1
2
3
f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …
4
5
0
1
f1, f2, ...
f1, f2, …
0
1
f3, f4, ...
f3, f4, …
6
0
f1, f2, …
0
f3, f4, …
main IR
parallel IR
12. Handling updates via ParallelReader
•
•
Pros:
– All types of data (e.g. searchable full-text
labels) can be added
Cons:
– Must ensure that the other index always
matches the structure of the main index
– Complicated and fragile (rebuild on every
update?)
– No tools to manage this parallel index in
Solr
ParallelReader
0
1
2
3
0
1
2
3
f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …
4
5
0
1
f1, f2, ...
f1, f2, …
6
0
f1, f2, …
main IR
0
1
f3, f4, ...
f3, f4, …
0
1
f3, f4, ...
f3, f4, …
0
f3, f4, …
parallel IR
13. Sidecar Index Components for Solr
•
•
•
•
Uses the ParallelReader strategy for field updates
– “Main” and “sidecar” data comes from two different Solr collections
– “Sidecar” collection is updated independently from the main collection
– “Sidecar” collection is used as a source of document fields for building and
updating a parallel index
Integrates the management of ParallelReader (“sidecar index”) into Solr
– Initial creation of ParallelReader, including synchronization of internal ID-s
– Tracking of updates and IndexReader.reopen(…) events
Partly based on a version of Click Framework in LucidWorks Search
Available under Apache License here: http://github.com/LucidWorks/sidecar_index
14.
15.
16. “Main”, “sidecar” collections and parallel index
•
•
•
•
•
“Main” collection contains only the parts of documents with “main” fields
“Sidecar” collection is a source of documents with “sidecar” fields
SidecarIndexReaderFactory creates and maintains the parallel index (sidecar
index)
“Main” collection uses SidecarIndexReader that acts as ParallelReader
Main index is updated as usual, via the “main” collection’s IndexWriter
Solr
Main_collection
SidecarIndexReader
main index
sidecar index
Sidecar_collection
17. Implementation details
•
•
•
SidecarIndexReaderFactory extends Solr’s IndexReaderFactory
– newReader(Directory, SolrCore) – initial open
– newReader(IndexWriter, SolrCore) – NRT open
SidecarIndexReader acts like a ParallelReader
– Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader
– Basically had to re-implement the logic from ParallelReader
ParallelReader challenges:
– How to synchronize internal ID-s?
– How to create segments that are of the same size as those of the main index?
– How to handle deleted documents?
– How to handle updates to the main index?
– How to handle updates to the sidecar data?
18. Sidecar collection
ParallelReader challenges and solutions
•
•
How to synchronize internal ID-s?
– “Main” collection is traversed sequentially by
internal docId
– Primary key is retrieved for each document
– Matching document is found in the “sidecar”
collection
– Matching document is added to the “sidecar” index
Very costly phase!
– Random seek and retrieval from “sidecar”
collection
– Primary key lookup is fast
– … but stored field retrieval and indexing isn’t
Main collection
G
B
C
E
A
F
D
q=id:D
0
1
2
3
0
1
2
3
D, f2, ...
B, f2, ...
A, f2, ...
F, f2, …
4
5
0
1
C, f2, ...
G, f2, …
6
0
f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …
f3, f4, ...
f3, f4, ...
f3, f4, …
E, f2, …
main IR
0
1
2
f3, f4, ...
f3, f4, ...
f3, f4, ...
sidecar IR
19. ParallelReader challenges and solutions
•
•
•
Optimization 1: don’t rebuild data for unmodified
segments
Optimization 2 (cheating): ignore NRT segments
How to handle deleted docs?
– Insert dummy (empty) documents so that
the number and the order of documents still
match
ParallelReader
0
1
2
3
0
1
2
3
f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …
0
1
2
3
f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …
4
5
0
1
f1, f2, ...
f1, f2, …
0
1
f3, f4, ...
f3, f4, …
X
7
0
1
f1, f2, ...
f1, f2, …
0
1
dummy
f3, f4, …
NRT
0
f1, f2, …
main IR
sidecar IR
20. Implementation: SidecarMergePolicy
•
•
How to create segments that are of the same size as
the “main” index?
Carefully manage the “sidecar” index creation:
– IndexWriter uses SerialMergeScheduler to
prevent out-of-order merges
– Force flush when reaching the next target count
of documents
– Merges are enforced using SidecarMergePolicy
that tracks the sizes of the “main” index segments
ParallelReader
0
1
2
3
0
1
2
3
f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …
0
1
2
3
f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …
4
5
0
1
f1, f2, ...
f1, f2, …
0
1
f3, f4, ...
f3, f4, …
6
0
f1, f2, …
0
f3, f4, …
main IR
SidecarMergePolicy
target sizes:
Seg0 – 4 docs
Seg1 – 2 docs
Seg2 – 1 doc
sidecar IR
21. Implementation: SidecarIndexReader
•
•
•
•
•
Re-implements the logic of ParallelReader
– ParallelReader != DirectoryReader
Exposes Directory of the “main” index for replication
– Replicas need the “sidecar” collection replica to rebuild the sidecar index locally
– If document routing and shard placement is the same then we don’t have to use
distributed search – all data will be local
Reopen(…) avoids rebuilding unmodified segments
Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when
necessary
– When there’s a major merge in the “main” index
– When “sidecar” data is updated
Ref-counting of IndexReaders at different levels is very tricky!
22. Example configuration in solrconfig.xml
<indexReaderFactory name="IndexReaderFactory"
class="com.lucid.solr.sidecar.SidecarIndexReaderFactory">
<str name="docIdField">id</str>
<str name="sourceCollection">source</str>
<bool name="enabled">true</bool>
</indexReaderFactory>
23. Example use case: integration of click-through data
•
•
•
Raw click-through data:
– Query, query_time, docId, click_time [, user]
Aggregated click-through data:
– User-generated popularity score: F(number and timing of clicks per docId)
• Numeric updates
– User-generated labels: F(top-N queries that led to clicks on docId)
• Full-text searchable updates
– User profiles: F(top-N queries per user, top-N docId-s clicked, etc)
– …
Queries can now be expanded to score based on TF/IDF in user-generated labels
26. Scalability and performance
•
•
•
•
Initial full rebuild is very costly
– ~0.6 ms / document
– 1 mln docs = 600 sec = 10 min
– Not even close to “real time” …
Cost related to new segments in “main” index depends on the size of segments
Major merge events will trigger full rebuild
BUT: search-time cost is negligible
27. Caveats
•
•
Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track
– The sidecar code is still unstable and occasionally explodes
Performance of full rebuild quickly becomes the bottleneck on frequent updates
– So the main use case is massive but infrequent updates of “sidecar” data
•
Code: http://github.com/LucidWorks/sidecar_index
•
Fixes and contributions are welcome – the code is Apache licensed