Solr 3.1 and beyond

Solr 3.1 and Beyond
yonik@lucidimagination.com
October 8, 2010
2
Lucid Imagination
Yonik Seeley

Agenda
Goal : Introduce new features you can try & use now in
Solr development versions 3.1 or 4.0
  Relevancy (Extended Dismax Parser)
  Spatial/Geo Search
  Search Result Grouping / Field Collapsing
  Faceting (Pivot, Range, Per-segment)
  Scalability (Solr Cloud)
  Odds & Ends
  Q&A
10/12/10 3

Solr 3.1? What happened to 1.5?
  Lucene/Solr merged (March 2010)
  Single set of committers
  Single dev mailing list (dev@lucene.apache.org)
  Single shared subversion trunk
  Keep separate downloads, user mailing lists
  Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
  Development
  trunk is now always next major release (currently 4.0)
  branch_3x will be base for all 3.x releases
  Branch together, Release together, Share version numbers

Extended Dismax Parser
  Superset of dismax
&defType=edismax&q=foo&qf=body

  Fixes edge cases where dismax could still throw
exceptions
OR

AND

NOT

-‐

“

  Full lucene syntax support
  Tries lucene syntax first
  Smart escaping is done if syntax errors
  Optionally supports treating “and”/”or” as AND/OR in
lucene syntax
  Fielded queries (e.g. myfield:foo) even in degraded
mode
  uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued)
  boost parameter for multiplicative boost-by-function
  Pure negative query clauses
Example: solr
OR
(-‐solr)

  Enhanced term proximity boosting
  pf2=myfield – results in term bigrams in sloppy phrase queries

myfield:“aa
bb
cc”

-‐>

myfield:“aa
bb”

myfield:“bb
cc”

  Enhanced stopword handling
  stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr
is
awesome
&
qf=myfield
&
pf2=myfield

-‐>

+myfield:(solr
awesome)

(myfield:”solr
is”
myfield:”is

awesome”)

  Currently controlled by the absence of StopWordFilter in index analyzer, and
presence in query analyzer

Spatial Search
10/12/10 9
Step1: Index some locations!
<field name=“name”>The Alpine Shop</field>
<field name=“store”>44.013617,-73.168264</field>
Step2: Decide where you are
&pt=44.0153371,-73.16734
&d=1
&sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt}
Bounding Box: &fq={!bbox}
Distance Function: &sort=geodist() asc

RESULT GROUPING /
FIELD COLLAPSING

Field Collapsing Definition
 Field collapsing
  Limit the number of results per category
  “category” normally defined by unique values in a field
 Uses
  Web Search – collapse by web site
  Email threads – collapse by thread id
  Ecommerce/retail
  Show the top 5 items for each store category (music, movies,
etc)

Field Collapse on Product Type
Result Grouping by Category

Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
10/12/10 14
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
{
"id":"MA147LL/A",

Group by Query
10/12/10 15
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
{

Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function
query>
Group by unique values produced by
the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as
“sort”
param
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to
each other (based on top doc)
10/12/10 16

Pivot Faceting
  Other names that could have made sense:
  Grid Faceting, Cross-Product Faceting, Matrix Faceting
  Syntax: facet.pivot=field1,field2,field3,…
10/12/10 18
#docs #docs w/
inStock:true
#docs w/
instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0
facet.pivot=cat,inStock

Pivot Faceting
"facet_counts":{
"facet_pivot":{
"cat,popularity":[{
"field":"cat",
"value":"electronics",
"count":14,
"pivot":[{
"field":"popularity",
"value":"6",
"count":5},
{
"value":"7",
"count":4},
10/12/10 19
http://...&facet=true&facet.pivot=cat,popularity
(continued)
{
"value":"1",
"count":2}]},
{
"field":"cat",
"value":"memory",
"count":3,
"pivot":[]},
[…]
14 docs w/
cat==electronics
5 docs w/
cat==electronics
&& popularity==6

Range Faceting
•  Like Date faceting, but
more generic
http://...&facet=true
&facet.range=price
&facet.range.start=0
&facet.range.end=500
&facet.range.gap=50
"facet_counts":{
"facet_ranges":{
"price":{
"counts":{
"0.0":5,
"50.0":2,
"100.0":0,
"150.0":2,
"200.0":0,
"250.0":1,
"300.0":2,
"350.0":2,
"400.0":0,
"450.0":1},
"gap":50.0,
"start":0.0,
"end":500.0}}}}
10/12/10 20

5
3
5
1
4
5
2
1
(null)
batman
flash
spiderman
superman
wolverine
order: for each
doc, an index into
the lookup array
lookup: the
string values
Lucene FieldCache Entry
(StringIndex) for the “hero” field
0
2
7
0
1
0
0
0
2
Documents
matching the
base query
“Juggernaut”
accumulator
increment
lookup
q=Juggernaut
&facet=true
&facet.field=hero
Priority queue
Batman, 3
flash, 5
Existing single-valued faceting
algorithm

Segment1
FieldCache
Entry
Segment2
FieldCache
Entry
Segment3
FieldCache
Entry
Segment4
FieldCache
Entry
0
2
7
0
3
5
0
1
2
0
2
1
0
1
3
0
4
0
1
0
Priority queue
Batman, 3
flash, 5
Base
DocSet
lookup
inc
accumulator1 accumulator2 accumulator3 accumulator4
FieldCache +
accumulator
merger
(Priority queue)
thread1
thread2 thread3
thread4
Per-segment single-valued
algorithm

Per-segment faceting
  Enable with facet.method=fcs
  Controllable multi-threading
facet.field={!threads=4}myfield

  Disadvantages
  Larger memory use (FieldCaches + accumulators)
  Slower (extra FieldCache merge step needed)
  Advantages
  Rebuilds FieldCache entries only for new segments (NRT friendly)
  Multi-threaded

Per-segment faceting performance
comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B

Faceting Performance Improvements
  For facet.method=enum, speed up initial
population of the filterCache (i.e. first time
facet): from 30% to 32x improvement
  Optimized facet.method=fc for multi-valued
fields and large facet.limit – up to 3x faster
  Optimized deep facet paging – up to 10x faster
with really large facet.offsets
  Less memory consumed by field cache entries
10/12/10 25

SolrCloud
  First steps toward simplifying cluster management
  Integrates Zookeeper
  Central configuration (schema.xml, solrconfig.xml, etc)
  Tracks live nodes + shards of collections
  Removes need for external load balancers
shards=localhost:8983/solr|localhost:8900/solr,

localhost:7574/solr|localhost:7500/solr

  Can specify logical shard ids
shards=NY_shard,NJ_shard

  Clients don’t need to know shards at all:
http://localhost:8983/solr/collection1/select?distrib=true

SolrCloud : The Future
  Eliminate all single points of failure
  Remove Master/Searcher distinction
  Enables near real-time search in a highly scalable environment
  High Availability for Writes
  Eventual consistency model (like Amazon Dynamo, Cassandra)
  Elastic
  Simply add/subtract servers, cluster will rebalance automatically
  By default, Solr will handle document partitioning

Auto-Suggest
  Many people currently use terms component
  Can be slow for a large corpus
  New auto-suggest builds off SpellCheck component
  Compact memory based trie for really fast completions
  Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
10/12/10 30
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}

Index with JSON
$
URL=http://localhost:8983/solr/update/json

$
curl
$URL
-‐H
'Content-‐type:application/json'
-‐d
'

{

"add":
{

"doc":
{

"id"
:
"978-‐0641723445",

"cat"
:
["book","hardcover"],

"title"
:
"The
Lightning
Thief",

"author"
:
"Rick
Riordan",

"series_t"
:
"Percy
Jackson
and
the
Olympians",

"sequence_i"
:
1,

"genre_s"
:
"fantasy",

"inStock"
:
true,

"price"
:
12.50,

"pages_i"
:
384

}

}

}'

31

Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
  Can handle multi-valued fields (see “cat” field in example)
  Completely compatible with the CSV update handler (can round-trip)
  Results are streamed – good for dumping entire parts of the index
10/12/10 32

http://localhost:8983/solr/browse
10/12/10 33

Q&A
For more information about Solr visit
www.lucidimagination.com

Solr 3.1 and beyond

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Solr 3.1 and beyond

Semelhante a Solr 3.1 and beyond (20)

Mais de Lucidworks (Archived)

Mais de Lucidworks (Archived) (20)

Solr 3.1 and beyond