At Basis Technologies Open Source Search conference I talked about a project I did this past year, and talked about the lessons, both good and the bad that we learned.
3. Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
Tuesday, October 2, 2012
4. 2n
d
ed
it
io
n!
CO-AUTHOR
Tuesday, October 2, 2012
5. war
Telling some stories
^
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
6. Not an intro to
SolrCloud!
• Great tutorials given by Tomás Fernández
Löbbe from LucidWorks yesterday!
Tuesday, October 2, 2012
7. Background for Client
X’s Project
• Big Data is any data set that is primarily at
rest due to the difficulty of working with it.
• 100’s of millions of documents to search
• Limited selection of tools available.
• Aggressive timeline.
• All the data must be searched per query.
• On Solr 3.x line
Tuesday, October 2, 2012
8. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
9. Boy meets Girl Story
Metadata
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
Tuesday, October 2, 2012
12. Make it easy to change
approach
Tuesday, October 2, 2012
13. Make it easy to change
sharding
public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
"com.o19s.solr.ModShardIndexStrategy").newInstance();
indexStrategy.configure(options);
for (SolrInputDocument doc:docs){
indexStrategy.addDocument(doc);
}
}
Tuesday, October 2, 2012
14. Separate JVM from Solr
Cores
• Step 1: Fire up empty Solr’s on all the
servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
search1.o19s.com:8983/solr/admin?
action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
in urls of Cores. (&property.shards=)
Tuesday, October 2, 2012
17. Simple Pipeline
• Simple pipeline
• mv is atomic
Tuesday, October 2, 2012
18. Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File
System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
Tuesday, October 2, 2012
19. Can you test your
changes?
Tuesday, October 2, 2012
20. JVM tuning is black art
-verbose:gc
-XX:+PrintGCDetails
-server
-Xmx8G
-Xms8G
-XX:MaxPermSize=256m
-XX:PermSize=256m
-XX:+AggressiveHeap
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=16
-XX:+UseParallelOldGC
Tuesday, October 2, 2012
23. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
24. Using Solr as key/value store
Solr Key/
Value Cache
Metadata
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
Tuesday, October 2, 2012
25. Using Solr as key/value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html
• how fast with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html
Tuesday, October 2, 2012
26. Push schema definition
to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of fields like
id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
Tuesday, October 2, 2012
27. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
Tuesday, October 2, 2012
28. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
Tuesday, October 2, 2012
29. Beware JavaBin
Solr Key/
Value Cache
Metadata
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
Tuesday, October 2, 2012
32. Beware JavaBin
Solr Key/ Solr 3.4
Value Cache
Metadata
Solr 4
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Which SolrJ
Files
version do I
use?
Tuesday, October 2, 2012
33. No JavaBin
/u
G te
p
iv /
da
e av
m r
e o!
• Avoid Jarmaggeddon
• Reflection? Ugh.
Tuesday, October 2, 2012
34. Avro!
• Supports serialization of data readable from
multiple languages
• It’s smart XML, w/o the XML!
• Handles forward and reverse versions of an
object
• Compact and fast to read.
Tuesday, October 2, 2012
35. Avro!
Solr Key/
Value Cache
.avro
Metadata Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
Tuesday, October 2, 2012
36. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012
37. Upgrade Lucene
Indexes Easily
• Don’t reindex!
• Try out new versions of
Lucene based search engines.
David Lyle
java -cp lucene-core.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-
commits] [-verbose] indexDir
Tuesday, October 2, 2012
48. th
in
POOLED ENVIRONMENT
k
Cl
ou
d!
Tuesday, October 2, 2012
49. Do I need Failover?
• Can I build quickly?
• Do I have a reliable cluster of servers?
• Am I spread across data centers?
• Is sooo 90’s....
Tuesday, October 2, 2012
50. Telling some stories
• Prototyping
• Application Development
• Maintaining Your Big Search Indexes
Tuesday, October 2, 2012