Introducing LucidWorks App for Splunk Enterprise webinar
Solr 3.1 and beyond
1. Solr 3.1 and Beyond
yonik@lucidimagination.com
October 8, 2010
2
Lucid Imagination
Yonik Seeley
2. Agenda
Goal : Introduce new features you can try & use now in
Solr development versions 3.1 or 4.0
Relevancy (Extended Dismax Parser)
Spatial/Geo Search
Search Result Grouping / Field Collapsing
Faceting (Pivot, Range, Per-segment)
Scalability (Solr Cloud)
Odds & Ends
Q&A
10/12/10 3
3. Solr 3.1? What happened to 1.5?
Lucene/Solr merged (March 2010)
Single set of committers
Single dev mailing list (dev@lucene.apache.org)
Single shared subversion trunk
Keep separate downloads, user mailing lists
Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
Development
trunk is now always next major release (currently 4.0)
branch_3x will be base for all 3.x releases
Branch together, Release together, Share version numbers
5. Extended Dismax Parser
Superset of dismax
&defType=edismax&q=foo&qf=body
Fixes edge cases where dismax could still throw
exceptions
OR
AND
NOT
-‐
“
Full lucene syntax support
Tries lucene syntax first
Smart escaping is done if syntax errors
Optionally supports treating “and”/”or” as AND/OR in
lucene syntax
Fielded queries (e.g. myfield:foo) even in degraded
mode
uf parameter controls what field names may be directly specified in “q”
6. Extended Dismax Parser (continued)
boost parameter for multiplicative boost-by-function
Pure negative query clauses
Example: solr
OR
(-‐solr)
Enhanced term proximity boosting
pf2=myfield – results in term bigrams in sloppy phrase queries
myfield:“aa
bb
cc”
-‐>
myfield:“aa
bb”
myfield:“bb
cc”
Enhanced stopword handling
stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr
is
awesome
&
qf=myfield
&
pf2=myfield
-‐>
+myfield:(solr
awesome)
(myfield:”solr
is”
myfield:”is
awesome”)
Currently controlled by the absence of StopWordFilter in index analyzer, and
presence in query analyzer
10. Field Collapsing Definition
Field collapsing
Limit the number of results per category
“category” normally defined by unique values in a field
Uses
Web Search – collapse by web site
Email threads – collapse by thread id
Ecommerce/retail
Show the top 5 items for each store category (music, movies,
etc)
13. Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
10/12/10 14
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"MA147LL/A",
14. Group by Query
10/12/10 15
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
"doclist":{"numFound":1,"start":0,"docs":[
{
15. Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function
query>
Group by unique values produced by
the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as
“sort”
param
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to
each other (based on top doc)
10/12/10 16
22. Per-segment faceting
Enable with facet.method=fcs
Controllable multi-threading
facet.field={!threads=4}myfield
Disadvantages
Larger memory use (FieldCaches + accumulators)
Slower (extra FieldCache merge step needed)
Advantages
Rebuilds FieldCache entries only for new segments (NRT friendly)
Multi-threaded
23. Per-segment faceting performance
comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B
24. Faceting Performance Improvements
For facet.method=enum, speed up initial
population of the filterCache (i.e. first time
facet): from 30% to 32x improvement
Optimized facet.method=fc for multi-valued
fields and large facet.limit – up to 3x faster
Optimized deep facet paging – up to 10x faster
with really large facet.offsets
Less memory consumed by field cache entries
10/12/10 25
26. SolrCloud
First steps toward simplifying cluster management
Integrates Zookeeper
Central configuration (schema.xml, solrconfig.xml, etc)
Tracks live nodes + shards of collections
Removes need for external load balancers
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr
Can specify logical shard ids
shards=NY_shard,NJ_shard
Clients don’t need to know shards at all:
http://localhost:8983/solr/collection1/select?distrib=true
27. SolrCloud : The Future
Eliminate all single points of failure
Remove Master/Searcher distinction
Enables near real-time search in a highly scalable environment
High Availability for Writes
Eventual consistency model (like Amazon Dynamo, Cassandra)
Elastic
Simply add/subtract servers, cluster will rebalance automatically
By default, Solr will handle document partitioning
29. Auto-Suggest
Many people currently use terms component
Can be slow for a large corpus
New auto-suggest builds off SpellCheck component
Compact memory based trie for really fast completions
Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
10/12/10 30
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}
30. Index with JSON
$
URL=http://localhost:8983/solr/update/json
$
curl
$URL
-‐H
'Content-‐type:application/json'
-‐d
'
{
"add":
{
"doc":
{
"id"
:
"978-‐0641723445",
"cat"
:
["book","hardcover"],
"title"
:
"The
Lightning
Thief",
"author"
:
"Rick
Riordan",
"series_t"
:
"Percy
Jackson
and
the
Olympians",
"sequence_i"
:
1,
"genre_s"
:
"fantasy",
"inStock"
:
true,
"price"
:
12.50,
"pages_i"
:
384
}
}
}'
31
31. Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
Can handle multi-valued fields (see “cat” field in example)
Completely compatible with the CSV update handler (can round-trip)
Results are streamed – good for dumping entire parts of the index
10/12/10 32