3. Who am i?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
7. Not an intro to cloud
Computing
• See Indexing Big Data on Amazon AWS by
Scott Stults @ 1:15 Thursday
• See How is the Government Spending Your
Money? How GCE is Using Lucene and the
GCE Big Data Cloud by Seshu Simhadi @
2:55 Thursday
8. Not an intro to
SolrCloud!
• See How SolrCloud Changes the User
Experience In a Sharded Environment by
Erick Erickson @ 2:55 Today
• See Solr 4: The SolrCloud Architecture by
Mark Miller @ 10:45 Tomorrow
9. My Assumptions for
Client X
• Big Data is any data set that is primarily at
rest due to the difficulty of working with it.
• Limited selection of tools available.
• Aggressive timeline.
• All the data must be searched per query.
• On Solr 3.x line
15. Make it easy to change
sharding
public void run(Map options, List<SolrInputDocument> docs) throws
InstantiationException, IllegalAccessException, ClassNotFoundException {
IndexStrategy indexStrategy = (IndexStrategy) Class.forName(
"com.o19s.solr.ModShardIndexStrategy").newInstance();
indexStrategy.configure(options);
for (SolrInputDocument doc:docs){
indexStrategy.addDocument(doc);
}
}
16. Separate JVM from Solr
Cores
• Step 1: Fire up empty Solr’s on all the
servers (nohup &).
• Step 2:Verify they started cleanly.
• Step 3: Create Cores (curl http://
search1.o19s.com:8983/solr/admin?
action=create&name=run2)
• Step 4: Create a “aggregator” core, passing
in urls of Cores. (&property.shards=)
20. Don’t Move Files
• SCP across machines is slow/error prone
• NFS share, single point of failure.
• Clustered file system like GFS (Global File
System) can have “fencing” issues
• HDFS shines here.
• ZooKeeper shines here.
27. Using Solr as a key/
value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html
• ??? with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html
28. Using Solr as key/value store
Solr Key/
Value Cache
Metadata
Ingest Solr
Solr
Pipeline Solr
Solr
Content
Files
29. Using Solr as key/value store
• thousands of queries per second without
real time get.
http://localhost:8983/solr/run2_enrichment/select?
q=id:DOC45242&fl=entities,html
• ??? with real time get?
http://localhost:8983/solr/run2_enrichment/get?
id=DOC45242&fl=entities,html
30. Push schema definition
to the application
• Not “schema less”
• Just different owner of schema!
• Schema may have common set of fields like
id, type, timestamp, version
• Nothing required.
q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
31. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
32. Don’t do expensive
things in Solr
• Tika content extraction aka Solr Cell
• UpdateRequestProcessorChain
33. Avro!
• Supports serialization of data readable from
multiple languages
• It’s smart XML
• Handles forward and reverse versions of an
object
• Compact and fast to read.
Search was the original big data problem. Then Google search came along, and search wandered in the wilderness of internal Enterprise search and ecommerce search. But now search is back, but with a new cooler name &#x201C;Big Data&#x201D;. Search interfaces are the dominant metaphor for working with big data sets by non data scientists.\n
SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
\n
And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration\n
\n
\n
\n
\n
\n
\n
And I don&#x2019;t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
\n
Notice our property style? Made it easy to read in properties in both Bash and Java!\n
Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
\n
\n
We had huge left over &#x201C;enterprise&#x201D; boxes with ginourmous amounts of ram and cpu\n\n
\n
\n
\n
The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
\n
\n
i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don&#x2019;t forget to store everything!)\n
\n
\n
\n
\n
You have many fewer Solrs then you do Indexer processors.\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
dollar tree makes crap. Stores are always empty or missing items. You don&#x2019;t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don&#x2019;t want your indexing process to be like launching the space shuttle.\n
\n
\n
\n
runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n