SlideShare uma empresa Scribd logo
1 de 35
Solr Powr
Enterprise-grade search for your app
Nick Zadrozny
Hi, my name is Nick.
       I’m a webdev — full-time
       w/ Rails since 2005.

       Generalist background.

       Perspective of a relative
       Solr noob.
Brought my generalist
perspective to Websolr about
six months ago.

We do hosted search

I enjoy doing things the Right
Way.




                             websolr
What is Solr?
How can we make the most of it?
Take some text

Make a list of the words
and where they show up

Of course, being geeks,
we throw a lot of
features into that




                           Indexing
Java search library that
does indexing. You give it
some words, it builds
those indexes.

Most of what we will
talk about is actually
Lucene.




                    Apache Lucene
What is Solr?

Web application interface for Lucene

Essentially RESTful

  POST in data, GET with queries

Various administrative features

Various web scaling features
Just so you know, I’m
going to be blurring Solr
and Lucene from here on
out.




                            Still with me?
Do smarter things with a
little bit of structure.




                           Schema
binary    external file
                          long
boolean   float
                          short
byte      geohash
                          string
date      int
                          text
double    integer
                          trie
Most of the interesting
stuff happens here




                          Text
adding and updating
records, doing statistics,
correlating with your sql
database, etc




                             Unique key
                             Not required, but handy.
tokenize on whitespace or non-letter chars

             standard tokenizer is sort of “type aware” and
             understands acronyms, urls, words with

Text         apostrophes

             so-called stop words since we’re not doing
             actual semantic language search

             Shingles: consecutive n-sized word groups
             “the quick” “quick brown” “brown fox” “fox
             jumped”


Tokenize words
                                                 Stop words
Strip HTML
                                                 Language stemming
Normalize case
                                                 Phonetic stemming
Normalize accented
                                                 Synonyms
characters
                                                 Word shingles
Pattern replacement
Index rich content
 HTML, PDF, Word, etc.
Add and Update
Serialize your         Updates are
documents to XML,      incremental
JSON and a handful
of others.

HTTP POST to your
Solr URL

Solr hands your data
to Lucene for
processing
Querying
Powerful query syntax.
  Boolean logic is just the start.
min, max, average,
stddev




              Numeric operations.
do stuff relative to
“now”




                       Date ranges,
                        date math.
Yeah, one killer feature
here is that Solr supports
spatial search.

Give it a lat/lon.




                             Distance.
Present the available values so your users
can filter by it.

Great for building out rich taxonomies.

Example: facet books by language, author,
genre.




                                             Faceting.
spelling suggestion for
user queries.

query auto-suggest from
popular queries




                   “Did you mean…?”
Generate a list of similar
documents. Consider blog
posts.




                             More Like This
Probably more.
Solr in Production
This is why we run Solr.




                     It’s really, really fast.
                           When properly configured.
Average max response
time is 75ms.

Even the 95 percentile is
way below that.
updates are incremental to keep things
running fast

for performance reasons, they don’t show up
in search results until you issue a commit

Commits are sorta heavy

200ms – 2 sec




                                              Commits
most of the time you
don’t have to worry
about this

        Lock the writer
but it’s easy to screw
this up if you flood the
system with updates and
commits

        Flush updates to disk
                                Tear down the old
        Start a new reader      reader

        Warm up the reader’s    Unlock the writer
        cache


        Register the reader
        with Solr
As you’re committing changes,
you’re usually creating new
files in “segments”

Optimize takes your index
and rewrites it into a more
compact number of files

Good to do this periodically to
use less memory and avoid
running out of open files




                                  Optimize
Actual replication is pull from slave and
really fast. Like, don’t worry.

Best way to deal with high IO.

Reads go to read cores, writes go to write
cores.

Scale read resources separately.

Make sure writes don’t interrupt reads.




                                            Replication.
                                             Stupidly easy.
All I’ll say is that it’s really
powerful and gives you a lot
of rope.

I’ve seen cache warmups
take down Tomcat — in
particular, on a very large
index with spatial search.




                                   Caching
I’m a Rails generalist

I like to do things the right way.

Solr is fast, fully-featured, and can be
scaled separately from the rest of your
app.

It takes the load off your database and
app servers, and does a better job.

In some cases, it offers features that just
aren’t other wise even possible.




                                           In Conclusion
Questions?
Thanks!

Mais conteúdo relacionado

Semelhante a 2010 08-06 - sd ruby - solr

UVA MDST 3703 JavaScript (ii) 2012-10-04
UVA MDST 3703 JavaScript (ii) 2012-10-04UVA MDST 3703 JavaScript (ii) 2012-10-04
UVA MDST 3703 JavaScript (ii) 2012-10-04
Rafael Alvarado
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using Solr
Stefano Bargioni
 

Semelhante a 2010 08-06 - sd ruby - solr (20)

The Holistic Programmer
The Holistic ProgrammerThe Holistic Programmer
The Holistic Programmer
 
Domain oriented development
Domain oriented developmentDomain oriented development
Domain oriented development
 
UVA MDST 3703 JavaScript (ii) 2012-10-04
UVA MDST 3703 JavaScript (ii) 2012-10-04UVA MDST 3703 JavaScript (ii) 2012-10-04
UVA MDST 3703 JavaScript (ii) 2012-10-04
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Presentation
PresentationPresentation
Presentation
 
Django in enterprise world
Django in enterprise worldDjango in enterprise world
Django in enterprise world
 
Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013
 
Introduction to Clojure
Introduction to ClojureIntroduction to Clojure
Introduction to Clojure
 
Falcon Full Text Search Engine
Falcon Full Text Search EngineFalcon Full Text Search Engine
Falcon Full Text Search Engine
 
Low maintenance perl notes
Low maintenance perl notesLow maintenance perl notes
Low maintenance perl notes
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the Library
 
Adding browse to Koha using Solr
Adding browse to Koha using SolrAdding browse to Koha using Solr
Adding browse to Koha using Solr
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Nodejs + Rails
Nodejs + RailsNodejs + Rails
Nodejs + Rails
 
Naming Things (with notes)
Naming Things (with notes)Naming Things (with notes)
Naming Things (with notes)
 
Hands on-solr
Hands on-solrHands on-solr
Hands on-solr
 
Solr tech talk
Solr tech talkSolr tech talk
Solr tech talk
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
 

2010 08-06 - sd ruby - solr

  • 1. Solr Powr Enterprise-grade search for your app Nick Zadrozny
  • 2. Hi, my name is Nick. I’m a webdev — full-time w/ Rails since 2005. Generalist background. Perspective of a relative Solr noob.
  • 3. Brought my generalist perspective to Websolr about six months ago. We do hosted search I enjoy doing things the Right Way. websolr
  • 4. What is Solr? How can we make the most of it?
  • 5. Take some text Make a list of the words and where they show up Of course, being geeks, we throw a lot of features into that Indexing
  • 6. Java search library that does indexing. You give it some words, it builds those indexes. Most of what we will talk about is actually Lucene. Apache Lucene
  • 7. What is Solr? Web application interface for Lucene Essentially RESTful POST in data, GET with queries Various administrative features Various web scaling features
  • 8. Just so you know, I’m going to be blurring Solr and Lucene from here on out. Still with me?
  • 9. Do smarter things with a little bit of structure. Schema
  • 10. binary external file long boolean float short byte geohash string date int text double integer trie
  • 11. Most of the interesting stuff happens here Text
  • 12. adding and updating records, doing statistics, correlating with your sql database, etc Unique key Not required, but handy.
  • 13. tokenize on whitespace or non-letter chars standard tokenizer is sort of “type aware” and understands acronyms, urls, words with Text apostrophes so-called stop words since we’re not doing actual semantic language search Shingles: consecutive n-sized word groups “the quick” “quick brown” “brown fox” “fox jumped” Tokenize words Stop words Strip HTML Language stemming Normalize case Phonetic stemming Normalize accented Synonyms characters Word shingles Pattern replacement
  • 14. Index rich content HTML, PDF, Word, etc.
  • 15. Add and Update Serialize your Updates are documents to XML, incremental JSON and a handful of others. HTTP POST to your Solr URL Solr hands your data to Lucene for processing
  • 17. Powerful query syntax. Boolean logic is just the start.
  • 18. min, max, average, stddev Numeric operations.
  • 19. do stuff relative to “now” Date ranges, date math.
  • 20. Yeah, one killer feature here is that Solr supports spatial search. Give it a lat/lon. Distance.
  • 21. Present the available values so your users can filter by it. Great for building out rich taxonomies. Example: facet books by language, author, genre. Faceting.
  • 22. spelling suggestion for user queries. query auto-suggest from popular queries “Did you mean…?”
  • 23. Generate a list of similar documents. Consider blog posts. More Like This
  • 26. This is why we run Solr. It’s really, really fast. When properly configured.
  • 27. Average max response time is 75ms. Even the 95 percentile is way below that.
  • 28. updates are incremental to keep things running fast for performance reasons, they don’t show up in search results until you issue a commit Commits are sorta heavy 200ms – 2 sec Commits
  • 29. most of the time you don’t have to worry about this Lock the writer but it’s easy to screw this up if you flood the system with updates and commits Flush updates to disk Tear down the old Start a new reader reader Warm up the reader’s Unlock the writer cache Register the reader with Solr
  • 30. As you’re committing changes, you’re usually creating new files in “segments” Optimize takes your index and rewrites it into a more compact number of files Good to do this periodically to use less memory and avoid running out of open files Optimize
  • 31. Actual replication is pull from slave and really fast. Like, don’t worry. Best way to deal with high IO. Reads go to read cores, writes go to write cores. Scale read resources separately. Make sure writes don’t interrupt reads. Replication. Stupidly easy.
  • 32. All I’ll say is that it’s really powerful and gives you a lot of rope. I’ve seen cache warmups take down Tomcat — in particular, on a very large index with spatial search. Caching
  • 33. I’m a Rails generalist I like to do things the right way. Solr is fast, fully-featured, and can be scaled separately from the rest of your app. It takes the load off your database and app servers, and does a better job. In some cases, it offers features that just aren’t other wise even possible. In Conclusion

Notas do Editor

  1. \n \n
  2. \n \n
  3. \n \n
  4. \n \n
  5. \n \n
  6. \n \n
  7. \n \n
  8. \n \n
  9. \n \n
  10. \n \n
  11. \n \n
  12. \n \n
  13. \n \n
  14. \n \n
  15. \n \n
  16. \n \n
  17. \n \n
  18. \n \n
  19. \n \n
  20. \n \n
  21. \n \n
  22. \n \n
  23. \n \n
  24. \n \n
  25. \n \n
  26. \n \n
  27. \n \n
  28. \n \n
  29. \n \n
  30. \n \n
  31. \n \n
  32. \n \n
  33. \n \n
  34. \n \n
  35. \n \n