Apache Lucene EuroCon brings together the experts driving innovation in Open Source Search. Check out this special one-hour round table presentation on compelling innovations in Lucene/Solr search....
2. Overview A link to download these
slides will be available after
the webcast is complete. An
• Introduction on-demand replay will be
ready in ~48 hours.
• Near Real Time Search: Yonik Seeley
• Munching & Crunching: Andrzej Białecki
• Solr in the Cloud: Mark Miller
• Practical Relevance: Grant Ingersoll
• Q&A
Apache Lucene EuroCon 20 May 2010 2
3. Near Real Time Search
Yonik Seeley
Apache Lucene EuroCon 20 May 2010
4. Near Real-Time Search
Shorter times until updates are searchable/visible
Lucene 2.9 first laid the groundwork w/ per-segment searching
Per-segment FieldCache entries for sorting and FunctionQueries
NRT IndexWriter.getReader()
Make new segments available before merging is done in background
Doesn’t cause commit/fsync first
Solr still needs
Per-segment faceting
Per-segment caching
Per-segment statistics (and anything else that uses FieldCache)
Apache Lucene EuroCon 20 May 2010 4
5. Existing single-values faceting algorithm
Documents matching the
Lucene FieldCache Entry
base query “Juggernaut”
(StringIndex) for the “hero” field
q=Juggernaut 0 order: for each
&facet=true 2 doc, an index into lookup: the
lookup
&facet.field=hero 7 the lookup array string values
5 (null)
accumulator 3 batman
5 flash
0
1 spiderman
1
4 superman
0 increment
5 wolverine
0
2
0
1
2
Apache Lucene EuroCon 20 May 2010 5
7. Per-segment faceting
Enable with facet.method=fcs
Controllable multi-threading
facet.field={!threads=4}myfield
Disadvantages
Larger memory use (FieldCaches + accumulators)
Slower (extra FieldCache merge step needed)
Advantages
Rebuilds FieldCache entries only for new segments (NRT friendly)
Multi-threaded
Apache Lucene EuroCon 20 May 2010 7
8. Per-segment faceting performance comparison
Test index: 10M documents, 18 segments, single valued field
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
A Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
B Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
*complete request time, measured externally
Apache Lucene EuroCon 20 May 2010 8
9. 9
Munching & Crunching
Lucene index post-processing and applications
Andrzej Białecki
<andrzej.bialecki@lucidimagination.com>
Apache Lucene EuroCon 20 May 2010
11. Post-processing
Isn't it better to build it right from the start?
Some parameters are difficult to get right...
Minimizing index size while retaining search quality
Correcting impact of unexpected common words
Creating evenly-sized shards
...perhaps impossible to get at all during indexing
Adding collection-wide factors not computed by Lucene (e.g. avg. length)
Optimizing top-N results for common queries
Fitting too large indexes in RAM
Apache Lucene EuroCon 20 May 2010 11
12. Merging, splitting, sorting, pruning
Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter
Sorting postings by impact and “early termination” search
Index pruning:
What data to remove and how?
Pruning strategies
Challenges
Apache Lucene EuroCon 20 May 2010 12
13. Tiered search
Assuming we CAN prune effectively, while maintaining good
search quality...
search box
RAM
70% pruned
SSD
30% pruned ?
HDD
0% pruned
Apache Lucene EuroCon 20 May 2010 13
14. Tiered search
Assuming we CAN prune effectively, while maintaining good
search quality...
search box 1
RAM
70% pruned
search box 2
SSD
30% pruned ?
search box 3
HDD
0% pruned
Apache Lucene EuroCon 20 May 2010 14
15. Bit-wise search
Given a bit pattern query:
1010 1001 0101 0001
Find best matching bit patterns in documents
Applications:
Fuzzy “fingerprinting”
De-duplication
Plagiarism detection
BitwiseSearcher and Solr BitwiseField design
Apache Lucene EuroCon 20 May 2010 15
16. Massive indexing
Map-reduce indexing models
Google model
Nutch model
Modified Nutch model
Hadoop contrib/indexing model
Tradeoff analysis and recommendations
Apache Lucene EuroCon 20 May 2010 16
17. 1
Solr in the Cloud
Mark Miller
Apache Lucene EuroCon 20 May 2010 17
19. Some of the Complications?
Dealing with config files
Setting up high availability
Status of cluster
Reshaping/Rebalancing cluster
Apache Lucene EuroCon 20 May 2010
19 19
20. Improvements: High Level Goals
Improve...
Shared/Central Config
High Availability and Fault Tolerance
Cluster Resizing/Rebalancing
Open/Standard ZK schema
Cluster status
Apache Lucene EuroCon 20 May 2010
20
21. Enter Solr Cloud and ZooKeeper
ZooKeeper is basically a highly available distributed filesystem
Config and cluster state ‘live’ in ZooKeeper
Solr is alerted to changes in cluster state by ZK
Solr gets a built in load balancing impl that can read cluster state
from ZK
Clients don’t need to know about shards - or can choose logical
shards
Apache Lucene EuroCon 20 May 2010
21
22. What’s Been Done So Far
A lot of ‘base’ work - ZooKeeper Mode
Shared/Central config
Built in search side fault tolerance
Very simple cluster status
Apache Lucene EuroCon 20 May 2010
22
23. The Future?
Index side fault tolerance
Cluster resizing/rebalancing/elasticity
More Solr/ZK tools?
Lots of other little fun improvements
Apache Lucene EuroCon 20 May 2010
23
24. Practical Relevance
Grant Ingersoll
Apache Lucene EuroCon 2010
Prague, Czech Republic
Apache Lucene EuroCon 20 May 2010 24
25. Why Tune Relevance?
Better search results = Less time searching, more time acting
Less time searching = Happier, more effective users
Happier, more effective users = $, €, £, Kč (earned/saved)
$, €, £, Kč (earned/saved) = Big fat raise for you!
Apache Lucene EuroCon 20 May 2010 25
26. Testing Relevance
A/B testing
Log Analysis
Empirical
Top 50 queries, plus random sample
Ask
Ratings/Reviews
Focus Groups
Also: Ad Hoc, TREC, etc.
Apache Lucene EuroCon 20 May 2010 26
27. Understand your…
Domain Tolerance for Pain
Types of documents
Managers
Languages present
Document structures, metadata Business Interests
and other features
Release cycles
Lexical resources: jargon,
synonyms, abbreviations... Obsession in finding the
Relationships between one true relevance model
documents
(hint, it doesn’t exist)
Users “explain() blindness”
Sophistication/Expertise
Search and Discovery needs
Known Item vs. Keyword
Apache Lucene EuroCon 20 May 2010 27
28. Phrases
Almost always a win to automatically add phrase query
variations to all multiword queries
Even better to detect key phrases
In Solr, with the Dismax handler, use the &pf and &ps options
to automatically add phrase boosts
Using a large slop factor can simulate an AND query while
rewarding close proximity
See also the ComplexPhraseQuery in contrib/queryparser
Consider SpanQuery and derivatives
Apache Lucene EuroCon 20 May 2010 28