Lucene and Solr Experts Round Table

Apache Lucene Eurocon: Preview
www.lucene-eurocon.org

Apache Lucene EuroCon 20 May 2010

Overview A link to download these
slides will be available after
the webcast is complete. An
• Introduction on-demand replay will be
ready in ~48 hours.

• Near Real Time Search: Yonik Seeley

• Munching & Crunching: Andrzej Białecki

• Solr in the Cloud: Mark Miller

• Practical Relevance: Grant Ingersoll

• Q&A

Apache Lucene EuroCon 20 May 2010 2

Near Real Time Search

Yonik Seeley


Near Real-Time Search
Shorter times until updates are searchable/visible

Lucene 2.9 first laid the groundwork w/ per-segment searching
Per-segment FieldCache entries for sorting and FunctionQueries
NRT IndexWriter.getReader()
Make new segments available before merging is done in background
Doesn’t cause commit/fsync first

Solr still needs
Per-segment faceting
Per-segment caching
Per-segment statistics (and anything else that uses FieldCache)


Existing single-values faceting algorithm
Documents matching the
Lucene FieldCache Entry
base query “Juggernaut”
(StringIndex) for the “hero” field
q=Juggernaut 0 order: for each
&facet=true 2 doc, an index into lookup: the
lookup
&facet.field=hero 7 the lookup array string values
5 (null)
accumulator 3 batman
5 flash
0
1 spiderman
1
4 superman
0 increment
5 wolverine
0
2
0
1
2

Per-segment single-valued faceting algorithm
Segment1 Segment2 Segment3 Segment4
FieldCache FieldCache FieldCache FieldCache
Entry Entry Entry Entry
accumulator1 accumulator2 accumulator3 accumulator4
inc
lookup 0 0 1 0
3 2 3 1
0
5 1 0 0
2
0 0 4
7 thread4
1 thread2 thread3
Base 2
DocSet
thread1 FieldCache + Priority queue
accumulator flash, 5
Batman, 3
merger
(Priority queue)

Per-segment faceting
Enable with facet.method=fcs

Controllable multi-threading
facet.field={!threads=4}myfield

Disadvantages
Larger memory use (FieldCaches + accumulators)
Slower (extra FieldCache merge step needed)

Advantages
Rebuilds FieldCache entries only for new segments (NRT friendly)
Multi-threaded


Per-segment faceting performance comparison
Test index: 10M documents, 18 segments, single valued field

Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

A Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms

Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

B Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms

*complete request time, measured externally

9

Munching & Crunching
Lucene index post-processing and applications

Andrzej Białecki

<andrzej.bialecki@lucidimagination.com>


Munching & Crunching Agenda
Post-processing
Splitting, merging, sorting, pruning

Tiered search

Bitwise search

Map-reduce indexing models


Post-processing
 Isn't it better to build it right from the start?

 Some parameters are difficult to get right...
 Minimizing index size while retaining search quality
 Correcting impact of unexpected common words
 Creating evenly-sized shards

 ...perhaps impossible to get at all during indexing
 Adding collection-wide factors not computed by Lucene (e.g. avg. length)
 Optimizing top-N results for common queries
 Fitting too large indexes in RAM


Merging, splitting, sorting, pruning
 Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter 

 Sorting postings by impact and “early termination” search

 Index pruning:
 What data to remove and how?
 Pruning strategies
 Challenges


Tiered search
 Assuming we CAN prune effectively, while maintaining good
search quality...
search box
RAM
70% pruned

SSD
30% pruned ?
HDD

0% pruned


Tiered search
 Assuming we CAN prune effectively, while maintaining good
search quality...
search box 1
RAM
70% pruned

search box 2
SSD
30% pruned ?
search box 3
HDD

0% pruned


Bit-wise search
 Given a bit pattern query:
1010 1001 0101 0001

 Find best matching bit patterns in documents

 Applications:
 Fuzzy “fingerprinting”
 De-duplication
 Plagiarism detection

 BitwiseSearcher and Solr BitwiseField design


Massive indexing
 Map-reduce indexing models
 Google model
 Nutch model
 Modified Nutch model
 Hadoop contrib/indexing model

 Tradeoff analysis and recommendations


1

Solr in the Cloud

Mark Miller

Some of the Complications?

Dealing with config files

Setting up high availability

Status of cluster

Reshaping/Rebalancing cluster

19 19

Improvements: High Level Goals

Improve...

 Shared/Central Config

 High Availability and Fault Tolerance

 Cluster Resizing/Rebalancing

 Open/Standard ZK schema

 Cluster status

20

Enter Solr Cloud and ZooKeeper
ZooKeeper is basically a highly available distributed filesystem

Config and cluster state ‘live’ in ZooKeeper

Solr is alerted to changes in cluster state by ZK

Solr gets a built in load balancing impl that can read cluster state
from ZK

Clients don’t need to know about shards - or can choose logical
shards

21

What’s Been Done So Far

A lot of ‘base’ work - ZooKeeper Mode

Shared/Central config

Built in search side fault tolerance

Very simple cluster status

22

The Future?

Index side fault tolerance

Cluster resizing/rebalancing/elasticity

More Solr/ZK tools?

Lots of other little fun improvements

23

Practical Relevance

Grant Ingersoll

Apache Lucene EuroCon 2010
Prague, Czech Republic


Why Tune Relevance?
 Better search results = Less time searching, more time acting

 Less time searching = Happier, more effective users

 Happier, more effective users = $, €, £, Kč (earned/saved)

 $, €, £, Kč (earned/saved) = Big fat raise for you!


Testing Relevance
 A/B testing
 Log Analysis
 Empirical
 Top 50 queries, plus random sample

 Ask
 Ratings/Reviews
 Focus Groups

 Also: Ad Hoc, TREC, etc.


Understand your…
Domain Tolerance for Pain
Types of documents
Managers
Languages present
Document structures, metadata Business Interests
and other features
Release cycles
Lexical resources: jargon,
synonyms, abbreviations... Obsession in finding the
Relationships between one true relevance model
documents
(hint, it doesn’t exist)
Users “explain() blindness”
Sophistication/Expertise
Search and Discovery needs
Known Item vs. Keyword

Phrases
 Almost always a win to automatically add phrase query
variations to all multiword queries
 Even better to detect key phrases

 In Solr, with the Dismax handler, use the &pf and &ps options
to automatically add phrase boosts
 Using a large slop factor can simulate an AND query while
rewarding close proximity
 See also the ComplexPhraseQuery in contrib/queryparser
 Consider SpanQuery and derivatives

Resources
 ACM SIGIR - http://sigir.org/

 http://www.lucidimagination.com/Community/Hear-from-the-
Experts/Articles/Debugging-Relevance-Issues-Search

 http://www.lucidimagination.com/Community/Hear-from-the-
Experts/Articles/Optimizing-Findability-Lucene-and-Solr

 Open Relevance Project:
http://lucene.apache.org/openrelevance


Q&A
SLIDES POSTED AT:
BIT.LY/EXPERTS1


1

Thank You


Lucene and Solr Experts Round Table

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (17)

Mais de Lucidworks (Archived)

Mais de Lucidworks (Archived) (20)

Último

Último (20)

Lucene and Solr Experts Round Table