Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Munching & crunching - Lucene index post-processing
1. 1
Munching & crunching
Lucene index post-processing and applications
Andrzej Białecki
<andrzej.bialecki@lucidimagination.com>
2. Intro
Started using Lucene in 2003 (1.2-dev?)
Created Luke – the Lucene Index Toolbox
Nutch, Hadoop committer, Lucene PMC member
Nutch project lead
3. Munching and crunching? But really...
Stir your imagination
Think outside the box
Show some unorthodox use and practical applications
Close ties to scalability, performance, distributed search and
query latency
5. Why post-process indexes?
Isn't it better to build them right from the start?
Sometimes it's not convenient or feasible
Correcting impact of unexpected common words
Targetting specific index size or composition:
Creating evenly-sized shards
Re-balancing shards across servers
Fitting indexes completely in RAM
… and sometimes impossible to do it right
Trimming index size while retaining quality of top-N results
Apache Lucene EuroCon 20 May 2010
6. Merging indexes
It's easy to merge several small indexes into one
Fundamental Lucene operation during indexing
(SegmentMerger)
Command-line utilities exist: IndexMergeTool
API:
IndexWriter.addIndexes(IndexReader...)
IndexWriter.addIndexesNoOptimize(Directory...)
Hopefully a more flexible API on the flex branch
Solr: through CoreAdmin action=mergeindexes
Note: schema must be compatible
Apache Lucene EuroCon 20 May 2010
7. Splitting indexes original index
segments_2
IndexSplitter tool:
Moves whole segments to standalone indexes
_0 _1 _2
Pros: nearly no IO/CPU involved – just rename &
create new SegmentInfos file
Cons:
segments_0
segments_0
segments_0
Requires a multi-segment index!
Very limited control over content of resulting
indexes → MergePolicy
new indexes
Apache Lucene EuroCon 20 May 2010
8. Splitting indexes, take 2 original index
del2
del1
d1
MultiPassIndexSplitter tool:
d2
Uses an IndexReader that keeps the list of deletions in memory
The source index remains unmodified d3
For each partition: d4
Marks all source documents not in the partition as deleted
Writes a target split using IndexWriter.addIndexes(IndexReader)
IndexWriter knows how to skip deleted documents
Removes the “deleted” mark from all source documents pass 1 pass 2
Pros: d1 d2
Arbitrary splits possible (even partially overlapping)
d3 d4
Source index remains intact
Cons: new indexes
Reads complete index N times – I/O is O(N * indexSize)
Takes twice as much space (source index remains intact)
… but maybe it's a feature?
Apache Lucene EuroCon 20 May 2010
9. Splitting indexes, take 3 1 2 3 4 5 6 7 8 9 10 ...
stored fields
term dict
SinglePassSplitter postings+payloads
Uses the same processing workflow as term vectors
SegmentMerger, only with multiple outputs
Write new SegmentInfos and FieldInfos 1 3 5… 1' 2' 3' 4' 5' 6'...
stored
Merge (pass-through) stored fields terms
partitioner
Merge (pass-through) term dictionary postings
term vectors
Merge (pass-through) postings with payloads
246… 1' 2' 3' 4' 5' 6'...
Merge (pass-through) term vectors
stored
Renumbers document id-s on-the-fly to form terms
contiguous space postings
term vectors
Pros: flexibility as with MultiPassIndexSplitter
Status: work started, to be contributed soon... renumber
Apache Lucene EuroCon 20 May 2010
10. Splitting indexes, summary
SinglePassSplitter – best tradeoff of flexibility/IO/CPU
Interesting scenarios with SinglePassSplitter:
Split by ranges, round-robin, by field value, by frequency, to a target size, etc...
“Extract” handful of documents to a separate index
“Move” documents between indexes:
“extract” from source
Add to target (merge)
Delete from source
Now the source index may reside on a network FS – the amount of IO is
O(1 * indexSize)
Apache Lucene EuroCon 20 May 2010
11. Index sorting - introduction
“Early termination” technique
If full execution of a query takes too long then terminate and estimate
Termination conditions:
Number of documents – LimitedCollector in Nutch
Time – TimeLimitingCollector
(see also extended LUCENE-1720 TimeLimitingIndexReader)
Problems:
Difficult to estimate total hits
Important docs may not be collected if they have high docID-s
Apache Lucene EuroCon 20 May 2010
12. Index sorting - details early termination == poor
original index
Define a global ordering of 0 1 2 3 4 5 6 7 doc ID
c e h f a d g b rank
documents (e.g. PageRank,
popularity, quality, etc)
Documents with good rank ID mapping
should generally score higher 4 7 0 5 1 3 6 2 old doc ID
0 1 2 3 4 5 6 7 new doc ID
Sort (internal) ID-s by this
ordering, descending
sorted index
Map from old to new ID-s 0 1 2 3 4 5 6 7 doc ID
to follow this ordering a b c d e f g h rank
early termination == good
Change the ID-s in postings
Apache Lucene EuroCon 20 May 2010
13. Index sorting - summary
Implementation in Nutch: IndexSorter
Based on PageRank – sorts by decreasing page quality
Uses FilterIndexReader
NOTE: “Early termination” will (significantly) reduce quality of
results with non-sorted indexes – use both or neither
Apache Lucene EuroCon 20 May 2010
14. Index pruning
Quick refresh on the index composition:
Stored fields
Term dictionary
Term frequency data
Positional data (postings)
With or without payload data
Term frequency vectors
Number of documents may be into millions
Number of terms commonly is well into millions
Not to mention individual postings …
Apache Lucene EuroCon 20 May 2010
15. Index pruning & top-N retrieval
N is usually << 1000
Very often search quality is judged based on top-20
Question:
Do we really need to keep and process ALL terms and ALL
postings for a good-quality top-N search for common
queries?
Apache Lucene EuroCon 20 May 2010
16. Index pruning hypothesis
There should be a way to remove some of the less important
data
While retaining the quality of top-N results!
Question: what data is less important?
Some answers:
That of poorly-scoring documents
That of common (less selective) terms
Dynamic pruning: skips less relevant data during query
processing → runtime cost...
But can we do this work in advance (static pruning)?
Apache Lucene EuroCon 20 May 2010
17. What do we need for top-N results?
Work backwards
“Foreach” common query:
Run it against the full index
Record the top-N matching documents
“Foreach” document in results:
Record terms and term positions that contributed to the score
Finally: remove all non-recorded postings and terms
First proposed by D. Carmel (2001) for single term queries
Apache Lucene EuroCon 20 May 2010
18. … but it's too simplistic:
0 quick 0 quick
before pruning 1 brown 1 brown after pruning
2 fox 2 fox
Query 1: brown - topN(full) == topN(pruned)
Query 2: “brown fox” - topN(full) != topN(pruned)
Hmm, what about less common queries?
80/20 rule of “good enough”?
Term-level is too primitive
Document-centric pruning
Impact-centric pruning
Position-centric pruning
Apache Lucene EuroCon 20 May 2010
19. Smarter pruning Freq
Not all term positions are equally corpus language
important model
document language
Metrics of term and position model
importance:
Plain in-document term frequency (TF)
TF-IDF score obtained from top-N results
of TermQuery (Carmel method)
Residual IDF – measure of term
informativeness (selectivity)
Key-phrase positions, or term clusters
Kullback-Leibler divergence from a Term
language model →
Apache Lucene EuroCon 20 May 2010
20. Applications
Obviously, performance-related
Some papers claim a modest impact on quality when pruning up to 60% of
postings
See LUCENE-1812 for some benchmarks confirming this claim
Removal / restructuring of (some) stored content
Legacy indexes, or ones created with a fossilized external chain
Apache Lucene EuroCon 20 May 2010
21. Stored field pruning
Some stored data can be compacted, removed, or restructured
Use case: source text for generating “snippets”
Split content into sentences
Reorder sentences by a static “importance” score (e.g. how many rare terms they
contain)
NOTE: this may use collection wide statistics!
Remove the bottom x% of sentences
Apache Lucene EuroCon 20 May 2010
22. LUCENE-1812: contrib/pruning tools and API
Based on FilterIndexReader
Produces output indexes via
IndexWriter.addIndexes(IndexReader[])
Design:
PruningReader – subclass of FilterIndexReader with necessary boilerplate and
hooks for pruning policies
StorePruningPolicy – implements rules for modifying stored fields (and list of field
names)
TermPruningPolicy – implements rules for modifying term dictionary, postings and
payloads
PruningTool – command-line utility to configure and run PruningReader
Apache Lucene EuroCon 20 May 2010
23. Details of LUCENE-1812
source index target index
stored fields StorePruningPolicy stored fields
IndexWriter
term dict term dict
postings+payloads TermPruningPolicy postings+payloads
term vectors term vectors
PruningReader
IW.addIndexes(IndexReader...)
IndexWriter consumes source data filtered via PruningReader
Internal document ID-s are preserved – suitable for bitset ops
and retrieval by internal ID
If source index has no deletions
If target index is empty
Apache Lucene EuroCon 20 May 2010
24. API: StorePruningPolicy
May remove (some) fields from (some) documents
May as well modify the values
May rename / add fields
Apache Lucene EuroCon 20 May 2010
25. API: TermPruningPolicy
Thresholds (in the order of precedence):
Per term
Per field
Default
Plain TF pruning – TFTermPruningPolicy
Removes all postings for a term where TF (in-document term frequency) is below
a threshold
Top-N term-level – CarmelTermPruningPolicy
TermQuery search for top-N docs
Removes all postings for a term outside the top-N docs
Apache Lucene EuroCon 20 May 2010
26. Results so far...
TF pruning:
Term query recall very good
Phrase query recall very poor – expected...
Carmel pruning – slightly better term position selection, but
still heavy negative impact on phrase queries
Recognizing and keeping key phrases would help
Use query log for frequent-phrase mining?
Use collocation miner (Mahout)?
Savings on pruning will be smaller, but quality will significantly improve
Apache Lucene EuroCon 20 May 2010
27. References
Static Index Pruning for Information Retrieval Systems, Carmel
et al, SIGIR'01
A document-centric approach to static index pruning in text
retrieval systems, Büttcher & Clark, CIKM'06
Locality-based pruning methods for web search, deMoura et al,
ACM TIS '08
Pruning strategies for mixed-mode querying, Anh & Moffat,
CIKM'06
Apache Lucene EuroCon 20 May 2010
28. Index pruning applied ...
Index 1: A heavily pruned index that fits in RAM:
excellent speed
poor search quality for many less-common query types
Index 2: Slightly pruned index that fits partially in RAM:
good speed, good quality for many common query types,
still poor quality for some other rare query types
Index 3: Full index on disk:
Slow speed
Excellent quality for all query types
QUESTION: Can we come up with a combined search strategy?
Apache Lucene EuroCon 20 May 2010
29. Tiered search
search box 1
search box 1
RAM
70% pruned
search box 2
search box 2 SSD
30% pruned ?
predict
evaluate
search box 3
search box 3 HDD
0% pruned
Can we predict the best tier without actually running the query?
How to evaluate if the predictor was right?
Apache Lucene EuroCon 20 May 2010
30. Tiered search: tier selector and evaluator
Best tier can be predicted (often enough ):
Carmel pruning yields excellent results for simple term queries
Phrase-based pruning yields good results for phrase queries (though less often)
Quality evaluator: when is predictor wrong?
Could be very complex, based on gold standard and qrels
Could be very simple: acceptable number of results
Fall-back strategy:
Serial: poor latency, but minimizes load on bulkier tiers
Partially parallel:
submit to the next tier only the border-line queries
Pick the first acceptable answer – reduces latency
Apache Lucene EuroCon 20 May 2010
31. Tiered versus distributed
Both applicable to indexes and query loads exceeding single
machine capabilities
Distributed sharded search:
increases latency for all queries (send + execute + integrate from all shards)
… plus replicas to increase QPS:
Increases hardware / management costs
While not improving latency
Tiered search:
Excellent latency for common queries
More complex to build and maintain
Arguably lower hardware cost for comparable scale / QPS
Apache Lucene EuroCon 20 May 2010
32. Tiered search benefits
Majority of common queries handled by first tier: RAM-based,
high QPS, low latency
Partially parallel mode reduces average latency for more
complex queries
Hardware investment likely smaller than for distributed search
setup of comparable QPS / latency
Apache Lucene EuroCon 20 May 2010
33. Example Lucene API for tiered search
Could be implemented as
a Solr SearchComponent...
Apache Lucene EuroCon 20 May 2010
35. References
Efficiency trade-offs in two-tier web search systems, Baeza-
Yates et al., SIGIR'09
ResIn: A combination of results caching and index pruning for
high-performance web search engines, Baeza-Yates et al,
SIGIR'08
Three-level caching for efficient query processing in large Web
search engines, Long & Suel, WWW'05
Apache Lucene EuroCon 20 May 2010
36. Bit-wise search
Given a bit pattern query:
1010 1001 0101 0001
Find documents with matching bit patterns in a field
Applications:
Permission checking
De-duplication
Plagiarism detection
Two variants: non-scoring (filtering) and scoring
Apache Lucene EuroCon 20 May 2010
37. Non-scoring bitwise search (LUCENE-2460)
Builds a Filter from intersection of: 0 1 2 3 4 docID
0x01 0x02 0x03 0x04 0x05 flags
DocIdSet of documents matching a Query a b b a a type
Integer value and operation (AND, OR, XOR)
“type:a”
“Value source” that caches integer values of
a field (from FieldCache)
0x01 0x02 0x03 0x04 0x05 flags
Corresponding Solr field type and
QParser: SOLR-1913 op=AND val=0x01
Useful for filtering (not scoring)
Filter
Apache Lucene EuroCon 20 May 2010
38. Scoring bitwise search (SOLR-1918)
BooleanQuery in disguise: docID D1 D2 D3
flags 1010 1011 0011
1010 = Y-1000 | N-0100 |
Y1000 Y1000 N1000
Y-0010 | N-0001
bits N0100 N0100 N0100
Y0010 Y0010 Y0010
Solr 32-bit BitwiseField N0001 Y0001 Y0001
Analyzer creates the bitmasks field
Currently supports only single value per field Q = bits:Y1000 bits:N0100
bits:Y0010 bits:N0001
Creates BooleanQuery from query int value
Results:
Useful when searching for best
matching (ranked) bit patterns D1 matches 4 of 4 → #1
D2 matches 3 of 4 → #2
D3 matches 2 of 4 → #3
Apache Lucene EuroCon 20 May 2010
39. Summary
Index post-processing covers a range of useful scenarios:
Merging and splitting, remodeling, extracting, moving ...
Pruning less important data
Tiered search + pruned indexes:
High performance
Practically unchanged quality
Less hardware
Bitwise search:
Filtering by matching bits
Ranking by best matching patterns
Apache Lucene EuroCon 20 May 2010
40. Meta-summary
Stir your imagination
Think outside the box
Show some unorthodox use and practical applications
Close ties to scalability, performance, distributed search and
query latency
Apache Lucene EuroCon 20 May 2010
43. Massive indexing with map-reduce
Map-reduce indexing models
Google model
Nutch model
Modified Nutch model
Hadoop contrib/indexing model
Tradeoff analysis and recommendations
Apache Lucene EuroCon 20 May 2010
44. Google model
Map(): Reduce()
IN: <seq, docText> IN: <term, list(<seq,pos>)>
terms = analyze(docText) foreach(<seq,pos>)
foreach (term) docId = calculate(seq, taskId)
emit(term, <seq,position>) Postings(term).append(docId, pos)
Pros: analysis on the map side
Cons:
Too many tiny intermediate records → Combiner
DocID synchronization across map and reduce tasks
Lucene: very difficult (impossible?) to create index this way
Apache Lucene EuroCon 20 May 2010
45. Nutch model (also in SOLR-1301)
Map(): Reduce()
IN: <seq, docPart> IN: <docId, list(docPart)>
docId = docPart.get(“url”) doc = luceneDoc(list(docPart))
emit(docId, docPart) indexWriter.addDocument(doc)
Pros: easy to build Lucene index
Cons:
Analysis on the reduce side
Many costly merge operations (large indexes built from scratch on reduce side)
(plus currently needs copy from local FS to HDFS – see LUCENE-2373)
Apache Lucene EuroCon 20 May 2010
46. Modified Nutch model (N/A...)
Map(): Reduce()
IN: <seq, docPart> IN: <docId, list(<docPart,ts>)>
docId = docPart.get(“url”) doc = luceneDoc(list(<docPart,ts>))
ts = analyze(docPart) indexWriter.addDocument(doc)
emit(docId, <docPart,ts>)
Pros:
Analysis on map side
Easy to build Lucene index
Cons:
Many costly merge operations (large indexes built from scratch on reduce side)
(plus currently needs copy from local FS to HDFS – see LUCENE-2373)
Apache Lucene EuroCon 20 May 2010
47. Hadoop contrib/indexing model
Map(): Reduce()
IN: <seq, docText> IN: <random, list(indexData)>
doc = luceneDoc(docText) foreach(indexData)
indexWriter.addDocument(doc) indexWriter.addIndexes(indexData)
emit(random, indexData)
Pros:
analysis on the map side
Many merges on the map side
Supports also other operations (deletes, updates)
Cons:
Serialization is costly, records are big and require more RAM to sort
Apache Lucene EuroCon 20 May 2010
48. Massive indexing - summary
If you first need to collect document parts → SOLR-1301 model
If you use complex analysis → Hadoop contrib/index
NOTE: there is no good integration yet of Solr and Hadoop contrib/index module...
Apache Lucene EuroCon 20 May 2010