5. Take some text
Make a list of the words
and where they show up
Of course, being geeks,
we throw a lot of
features into that
Indexing
6. Java search library that
does indexing. You give it
some words, it builds
those indexes.
Most of what we will
talk about is actually
Lucene.
Apache Lucene
7. What is Solr?
Web application interface for Lucene
Essentially RESTful
POST in data, GET with queries
Various administrative features
Various web scaling features
8. Just so you know, I’m
going to be blurring Solr
and Lucene from here on
out.
Still with me?
10. binary external file
long
boolean float
short
byte geohash
string
date int
text
double integer
trie
11. Most of the interesting
stuff happens here
Text
12. adding and updating
records, doing statistics,
correlating with your sql
database, etc
Unique key
Not required, but handy.
13. tokenize on whitespace or non-letter chars
standard tokenizer is sort of “type aware” and
understands acronyms, urls, words with
Text apostrophes
so-called stop words since we’re not doing
actual semantic language search
Shingles: consecutive n-sized word groups
“the quick” “quick brown” “brown fox” “fox
jumped”
Tokenize words
Stop words
Strip HTML
Language stemming
Normalize case
Phonetic stemming
Normalize accented
Synonyms
characters
Word shingles
Pattern replacement
15. Add and Update
Serialize your Updates are
documents to XML, incremental
JSON and a handful
of others.
HTTP POST to your
Solr URL
Solr hands your data
to Lucene for
processing
20. Yeah, one killer feature
here is that Solr supports
spatial search.
Give it a lat/lon.
Distance.
21. Present the available values so your users
can filter by it.
Great for building out rich taxonomies.
Example: facet books by language, author,
genre.
Faceting.
28. updates are incremental to keep things
running fast
for performance reasons, they don’t show up
in search results until you issue a commit
Commits are sorta heavy
200ms – 2 sec
Commits
29. most of the time you
don’t have to worry
about this
Lock the writer
but it’s easy to screw
this up if you flood the
system with updates and
commits
Flush updates to disk
Tear down the old
Start a new reader reader
Warm up the reader’s Unlock the writer
cache
Register the reader
with Solr
30. As you’re committing changes,
you’re usually creating new
files in “segments”
Optimize takes your index
and rewrites it into a more
compact number of files
Good to do this periodically to
use less memory and avoid
running out of open files
Optimize
31. Actual replication is pull from slave and
really fast. Like, don’t worry.
Best way to deal with high IO.
Reads go to read cores, writes go to write
cores.
Scale read resources separately.
Make sure writes don’t interrupt reads.
Replication.
Stupidly easy.
32. All I’ll say is that it’s really
powerful and gives you a lot
of rope.
I’ve seen cache warmups
take down Tomcat — in
particular, on a very large
index with spatial search.
Caching
33. I’m a Rails generalist
I like to do things the right way.
Solr is fast, fully-featured, and can be
scaled separately from the rest of your
app.
It takes the load off your database and
app servers, and does a better job.
In some cases, it offers features that just
aren’t other wise even possible.
In Conclusion