SlideShare uma empresa Scribd logo
1 de 71
Solr/ElasticSearch
for CF Developers
By Mary Jo Sminkey
Who Am I?
• Senior Web Developer at CFWebtools, LLC
• ColdFusion Developer since CF3/Allaire
• Cosplayer
• Dog Trainer
• Sewer/Baker/Knitter/Origamist/Ass. Crafts
• Cancer Survivor
• Fibromyalgia/Invisibile Disabilities Advocate
What is Solr?
• Standalone Full-Text Search Engine with Apache Lucene Backend
• Open-source, distributed, highly scalable, enterprise grade search
• http://lucene.apache.org/solr/
• Included in ColdFusion since CF9 replacing Verity (cfsearch/cfindex)
• Thru CF11 – Solr 3
• CF2016 – Solr 5 (5.1.2)
• Current Release of Solr is version 6.2.0
Why use Solr/ES instead of CF tags?
• Any CF version prior to 2016 has ancient Solr 3.x versions
• Full Access to latest Solr/ES versions and patches
• Ability to use cloud-based distributed setups (essential for enterprise
sites)
• Access to far more features and use of REST/JSON
• Code more easily converted to other search engines and languages
Solr vs. ElasticSearch – What should I use?
• Solr has been around a lot longer so more mature, very well documented
and has strong backwards compatibility. Developers often mention ES not
nearly as well documented so plan on investing in other sources to really get
a handle on it.
• ES being younger is built on modern standards and ideas (particularly
REST), designed specifically for handling large indices and high query rates,
and since it isn’t as strictly commuinty driven often can move forward quicker
with new features, bug fixes, etc. (although co
• Both have very active communities and are very actively still being
developed and moved forward. Solr in particular has pretty much caught
up to many of the advances brought by ElasticSearch entering the
marketplace such as full REST support.
Solr vs. ElasticSearch – What should I use?
(cont)
• Solr excels at text-search applications, ElasticSearch for analytics (lots
of monitoring and metrics exposed).
• In areas like log analysis, ES is by far the more common choice to
use. This is due to its very advanced “aggregations” framework, which
replaced earlier faceting.
• https://www.elastic.co/blog/out-of-this-world-aggregations
• Solr uses a terse syntax, vs. ES which is much more verbose
• This makes ES generally more readible, but the terse syntax of Solr
make more advanced relevancy possibilites easier to handle, of
particular interest in text-search applications.
Solr vs. ElasticSearch – What should I use?
(cont)
• SearchComponents in Solr allow for much more easily customizable
searches that can be easily reused across multiple applications or
within an application.
• ES generally considered a bit easier to get started with and do
clustering etc. Solr generally requires a bit more work to get your head
around, forces you to read over and learn the config files to get
running for instance - but this is not necessarily a BAD thing.
• If you are going to use REST, both now have excellent support
although ES more REST-compliant. But if you plan to go another
route, Solr tends to have better support, for instance it has excellent
Java support via the Solrj library.
Amazon CloudSearch
• There are some other Lucene-based searches you can consider.
• Most popular of these is Amazon CloudSeach
• Easy to set up, AWS managed service with automatic scaling
• Provides most commonly used text-search features like highlighting,
autocomplete, simple faceting, grouping, geospatial search, etc.
• It is considerably more limited that Solr and ES when it comes to doing
advanced search relevancy tuning and/or advanced metrics.
Solr vs. ElasticSearch – More Reading
• http://solr-vs-elasticsearch.com/
• https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/
• https://www.datanami.com/2015/01/22/solr-elasticsearch-question/
• http://opensourceconnections.com/blog/2015/12/15/solr-vs-elasticsearch-
relevance-part-one/
• http://opensourceconnections.com/blog/2016/01/22/solr-vs-elasticsearch-
relevance-part-two/
• http://harish11g.blogspot.com/2015/07/amazon-cloudsearch-vs-
elasticsearch-vs-Apache-Solr-comparison-report.html
So let’s look at the features of Solr
(particularly Solr 4+ versions)
• Full REST API for schema management, indexing, searching, etc.
(Solr 5+)
• Wide variety of built in tokenizers and analyzers
• Grouping, faceting, highlighting, spelling suggestions, autocomplete
• Filtering, document and field boosting, custom ranking, etc.
• Near Real-Time Indexing
• Extensible Plugin Architecture
Solr 6 Features
• Parallel SQL – The big WOW feature of version 6 is bringing SQL support to
Solr which works across SolrCloud collections. This is done by a SQL parser
that converts SQL queries to Solr streaming expressions.
• SQL Request Handler – SolrCloud collections can be queried with standard
SQL language using the /sql request handler.
• JDBC Driver – Connect to the SolrCloud collections with any tool that
supports JDBC and query the collection directly
• Still somewhat experimental and not quite ready for primetime usage but
improving rapidly.
Solr 6 Features (cont)
• Many other improvements and advancements with streaming
expressions (Merging search with parallel computing, across multiple
sources)
• Push/Pull streaming, request/response streaming
• Solr collections able to auto-update itself via these kinds of streaming
commands.
• https://sematext.com/blog/tag/streaming-expressions/
Let’s Focus on Text Searches!
• This is primarily what CF developers would have been using Solr
integration for – cfsearch/cfindex
• While a number of things we’re going to look at are included in the CF
integration, you can do a lot more once you move to standalone Solr.
Our Target Site and Objectives
• Classic industries (classicindustries.com)
• Ecommerce Site for Classic Car Parts
• Customers can select a car model (catalog) and year to filter their search
• Single text box search that needs to search across multiple fields but return
the best possible matches
• Search pages need to also include data like breadcrumb trail, category
menus, nested structure of category totals, etc.
• We would like to add additional elements like spelling suggestions and
highlighting.
Step 1 - Schema
• The schema defines the fields and their types that will be indexed for
searching.
• Solr/ES can both be used schema or schema-less, support dynamic
field types, etc. Typically you would only use schema-less for
development and then switch to the managed schema for production.
• Solr 5 and up can handle most schema changes via the REST
service.
• You can also make schema changes via the Solr admin console
Sample REST – Add Field Type
POST /schema
Content-type: application/json
{
"add-field-type": {
"name": "simpleTextSpell",
"class": "solr.TextField",
"positionIncrementGap":100,
"indexAnalyzer": {
"tokenizer": {
"class": "solr.StandardTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.RemoveDuplicatesTokenFilterFactory"
}
]
}
}
}
Sample REST – Add Field
POST /schema
Content-type: application/json
{
"add-field": {
"name": "simpleSpell",
”type": "simpleTextSpell",
”indexed": true,
”stored”: true
}
}
Schema - Analyzers, Tokenizers and Filters
• These are used to tell Solr how to prepare the text string for indexing
(and/or quering).
• Proper handling of this step is essential for good search results.
• While Solr has a lot of built-in field types for text fields, you may
oftenneed to add your own field types to get the best results.
• With simple text fields, you often will use the same analyzer for the
indexing and the query steps. The more complex handling your field
needs, the more likely you may need different analyzers for the
indexing vs. the querying.
Schema - Sample Analyzers
<fieldType name="nametext" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
</analyzer>
<analyzer type="query”>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Schema - Tokenizers
• Tokenizers determine how the text string will be split up into “tokens”.
A common first step is to split the string on whitespace and/or
punctuation (sentences, etc. split into the individual words so we can
search on them instead of the entire */string).
• Solr includes a whole variety of tokenizers includes ones designed for
specific kinds of data, like file paths or email addresses, as well as to
handle multi-lingual text.
• You can also process your token using regular expressions…. Or
return the entire field as a single token.
Schema - Filters
• Filters are used after tokenizing to further manipulate your data.
• Some common filters actions are to convert everything to lowercase
so searches aren’t case-sensitive and discarding common words that
aren’t useful in searches (a, and, the, etc.)
• A dictionary filter might be used on a field that you intend to use for
spelling corrections.
• You can also use a synonym filter to create word mappings to match
on.
Schema - CharFilters
• Unlike regular filters, charFilters are used PRIOR to tokenizing your
data.
• You might use these to do things like strip out HTML tags, comments,
or any other text you don’t want your search to find.
• Solr includes both a charFilter specifically for removing HTML markup
as well as a Regex style replace filter.
Schema - Copy Fields
• Your schema can include fields which just copy data from other fields
to index. Typically they will have a different set of analyzers to
manipulate the data.
• For example, I may want my search to return matches on the original
words higher than the ones that match on a synonym. To do this, I
would use a field copy for the synonym matches which will give lower
ranking.
• Another common use is for spell checking in which you may want to
copy text fields that use different field types to the one you use for
spellchecking.
Schema in Place, Let’s Index Our Data!
• In the past most CF code would use the SolrJ library for standalone
Solr work.
• This is still an option but now we have REST as an alternative. The
REST libraries are generally much easier to work with since you don’t
have to figure out all the nested methods that are holding your data,
it’s just all returned in a simple JSON object.
• There is very little now in Solr that you cannot do through REST,
including adding or modifying cores, making all schema changes, and
of course indexing and searching.
Schema in Place, Let’s Index Our Data! (cont)
• Solr’s REST integration has been continually improved but you may still run
into some gotchas. For instance, the handling of multiple documents in an
indexing request when you need to include additional parameters like a
custom boost makes it impossible in most languages to simply convert a
native object to the necessary JSON object (multiple name-value pairs in an
object with the same name).
• ColdFusion has its own quirks (bugs) that you have to watch out for. The
most common one you’ll run into is it trying to treat a string that is all
numbers as a numeric value and not wrapping it in quotes. A typical hack is
to add some string in front of any such field prior to CF serializing and then
doing a search and replace to remove it prior to the REST request.
Sample REST – Indexing Data
POST //update?wt=json
Content-type: application/json
{
"add": {
{ "doc" :
{
”productname": ”Sample AC Part",
“sku”: “AC234”,
”catalogs" : [”Camaro”, “Impala”,”Firebird”],
"id" : 8753 }
} ,
“boost”: 2.0
}
"add": {
{ "doc" :
{
”productname": ”Sample Body Part",
“sku”: “BD495”,
”catalogs" : [”Camaro”],
"id" : 968944}
}
}
}
More on Indexing
• Adding and updating data uses the same request format, if you
include a key that is already in the index, Solr will update it.
• However keep in mind that if you are making schema or major data
changes, that updating won’t REMOVE old keys. To do so you need to
either locate those and send delete requests, or you need to purge
your data and then do a clean re-indexing (advantage of SolrCloud
over using same server to index and search).
• You can delete either by key, or by query. For example, purge all data
with deleteByQuery('*.*')
More on Indexing (cont.)
• You can also index bulk data using a data import handler, which can
be done in a number of formats.
• By default Solr doesn’t have any security on the admin or managed
schema, so you will want to lock it down for production servers.
• The solr config allows you to specific auto-commit times, replication to
slave servers, etc.
• Soft auto-commits can be used so updates can be made live almost
immediately without the overhead of doing a hard commit (near real-
time search).
We Have Data, Now Let’s Search!
• Solr comes with some built-in request handlers you can use or
customize, or you can add your own.
• The request handler configuration determines what settings, defaults
and components (like spellcheck) are available for requests to that
handler.
• The simpliest search is just to send the query parameter “q” to the
search request: /select?q=front+bumper
• Search results can be returned in a variety of formats, including json,
xml, csv and language-specific formats like php and ruby.
Query Parsers
• The parser used determines what parameters you can use.
• For text searching you generally will use the Dismax or Extended
Dismax parsers, which allow for improving the relevance of your
search results.
• Dismax includes term boosting, phrase boosting and minimum-should-
match parameters among others.
• Extended Dismax extends this with even more boosting options
including field boosting, more phrase boosting options, proximity
boosting, and ignoring stopwords at query time.
Filters
• All types of Solr query parsers support filters.
• This is the most basic way of restricting what documents to search.
• In our sample site application, we add filters based on things like the
catalog and year the user selects, if they are looking for new or outlet
products, if they have drilled down into a category, etc.
• You can have any number of filters and they can include complex
boolean expressions.
Filter Examples
• fq=catalogid:1
• fq=year:1967
• fq=newproduct:true
• fq=catalogid:(1 TO 15)
• fq=(discontinueflg:N OR availablecount:[1 TO *])
Search Relevance
• This is a topic we could spend an entire day on.
• Many enterprise sites have regular search audits and do extensive
analysis to look at their relevancy scores and how to improve them.
• We’ll take a quick look at our example site and some things that Solr
allows us to do in order to improve our search relevancy.
• We are using Extended Dismax for the maximum possible options for
controlling search relevancy.
Search Relevance (cont)
• By default Solr is scoring documents by how many times the search
terms are found.
• What we want to do is “boost” fields and documents, etc. that we want
Solr to place more emphasis on.
• We want to also look at how to handle searches that include multiple
terms to search on (phrases).
Search Relevance – Example App Fields
• Product Number (SKU) – we want to put matches on the SKU right at
the top in searches.
• Product Name – next most important is matches on the product name.
All the most relevant search terms are included in the product name.
• Keywords – custom keywords have additional search terms and
abbreviations we want to match for the product so are fairly relevant.
• Product Info – this is full description of the product that can be used for
searches but due to the extensive amount of text and non-related
words it can have, it’s of fairly low importance
Search Relevance – Synonyms
• Solr has support for synonyms which allow you to map words to
similar ones that you want it to also consider a match.
• Synonyms can be one-way or bi-directional. For instance if there is a
common misspelling people use in a search, you would map that in
one directon only, to the correct spelling.
• Solr does not properly handle multi-term synonyms (see the ‘sea
biscuit’ problem). This is a long-standing bug and there are some
plugins to try and correct for it but they often result in issues with more
complex relevancy setups.
Search Relevance – Sample App Synonyms
• Since we want matches on the original search term to always appear
higher than matches for synonyms, we need to copy the fields used so
we can boost them separately. These fields only need to be indexed,
not stored.
• prodnamesynonym – Product Name Synonym Field. This will get a
boost high enough to help matches appear above most of the other
fields, but not as high as the original product name field.
• proddatasynonym – Additional Product Data Synonym Field. We’ll
copy all the other text fields to this one and give it the lowest boost
score.
Search Relevance – Boosting
• The default value that Solr gives for boosts is 1.0
• Solr does not support negative boosts but anything below 1.0 is
basically a negative boost based on the default.
• Keep in mind as well that Solr is going to score documents on how
often the search terms appear as well. You can use a filter in your
schema to remove duplicate tokens if you don’t want it to do this.
• The boosts for your query are set on the “qf” parameter which tells
Solr which fields you want to query.
Sample App Boosting
prodnumbertext^20.0
prodname^10.0
prodnamesynonym^5.0
keywords^2.0
productinfo^1.0
proddatasynonym^0.25
http://localhost:8983/solr/classic/select?q=front+bumper&defType=edismax&qf=proddatasynonym^0.25+productinfo^1.0+keywords^2.0+prodna
mesynonym^5.0+prodname^10.0+prodnumbertext^20.0
Search Relevance – Phrase Boosting
• Phrase boosting is used for multi-term searches.
• By default, Solr will score documents the same no matter where the
search terms appear in the documents.
• Phrase boosting allows you to score higher the documents where the
search terms are appearing next to, or close to, each other.
• The original phrase boost from the Dismax query parser boosts only
for all search terms being close together. Edismax adds options for 2
and 3-word phrases in your search terms.
Search Relevance – Phrase Boosting (cont)
• The phrase slop setting is used to set how far away terms can be in
order to be consider a match for phrase boost.
• If you set 2 and 3 word phrase boosting, you can use different slop
settings for them.
• Phrase boosting doesn’t have any effect on what documents are
returned by the search, ONLY how they get scored.
Sample App Phrase Boosting
prodname^50.0
prodnamesynonym^25.0
keywords^10.0
productinfo^5.0
proddatasynonym^0.25
http://localhost:8983/solr/classic/select?q=front+bumper&defType=edismax&pf=proddatasynonym^0.25+productinfo^5+keywords^10+prodname
synonym^25+prodname^50&pf2=proddatasynonym^0.25+productinfo^5+keywords^10+prodnamesynonym^25+prodname^50&pf3=proddatasy
nonym^0.25+productinfo^5+keywords^10+prodnamesynonym^25+prodname^50&ps=1
Search Relevance – More Boosting
• There are other boosting options, such as boosting a specific term in
the search, telling Solr to boost the documents that match that
particular term over other terms in the search.
• You can also create complex functions for boosting documents.
• Another common boost is to apply one during indexing to specific
documents. For instance in our classic car site, we apply a boost to
the products that are our best sellers, so that all things being equal,
those products will appear higher in searches.
Search Relevance – More Boosting
• There are other boosting options, such as boosting a specific term in
the search, telling Solr to boost the documents that match that
particular term over other terms in the search.
• You can also create complex functions for boosting documents.
• Another common boost is to apply one during indexing to specific
documents. For instance in our classic car site, we apply a boost to
the products that are our best sellers, so that all things being equal,
those products will appear higher in searches.
Minimum-Should-Match
• Solr supports both AND and OR searches which you can set with the q.op
parameter.
• When doing an OR search with multiple search terms, you can also set the minimum
number of terms that have to be matched to return a document.
• For instance if mm=75% and the user enters 4 search terms, then only 3 have to
match to return the document.
• This helps ensure that as users enter more search terms that have a higher chance
of not matching, you can make sure the results still require matching on more than
just a single term (typical OR search) without causing a high percentage of failed
searches.
• There are various options for how to set the minimum match, and you can customize
it based on the number of search terms that were entered.
Example Minimum-Should-Match
• Match 75% of the terms
mm=75%
• 25% of the terms can be missing (same result as previous)
mm=-25%
• For more than 1 term, allow 1 to be missing. For more than 4 terms,
allow 2 mising. For more than 6 terms, allow 33% missing.
mm=1<-1 4<-2 6<-33%
Result Grouping
• Allows you to group results with a similar value in a specific field together.
• This is something you may commonly have to do for something like an
ecommerce site where a single product appears in multiple categories but
you only want to show (or count) one copy of each product.
• Solr gives you a very wide range of options over how to handle the groups.
• You can only group on a single field which should be indexed and a text field.
• The best way to show off what grouping can do is look at our example site.
Result Grouping - Example
• Our classic car parts site has products that appear in multiple car
model catalogs as well as in multiple categories (3-tiers of categories).
• We’ve indexed them based on this combination of unique product id,
catalog id and 3 category ids.
• So when we search, we may get multiples of a specific product and
need to group on the product id to get accurate product counts.
• We also want to simplify the groups as much as possible so that we
don’t have to do a lot of extra work to get to the data.
Group Parameters for Search
<cfscript>
var params = {};
params["group"] = true;
params["group.field"] = ‘product_id’;
// tells Solr to return the total number of groups found
//this will be our total number of products found
params["group.ngroups"] = true;
//this gives us the grouped documents in a flat list
params["group.format"] = 'simple';
//since the groups are just copies of the same product,
//we only need one document in each group
params["group.limit"] = 1;
</cfscript>
Result Grouping – Cont.
• This is just one way to use groups. You can create groups based on
functions and do much, much more with them.
• You can sort both the list of groups (based on the most relevant
document in each) as well as inside the groups themselves.
• Likewise paging can be done both inside and external to the grouping.
Collapse and Expand Results
• These are an alternative to result grouping that are useful particularly
for displaying collapsed search results.
• You provide the field to “collapse” on and Solr will provide results with
a single document per group for that field.
• The expand is then used to tell Solr to return the same query but this
time with an “expanded” section that includes all the documents in the
groups.
Faceting
• Faceting is widely used in commerce sites to show the customer result
counts for numerous search criteria.
• This is where a search engine like Solr really shines over using
database-only solutions that require additional queries and often
extensive processing to come up with the same data.
• You can request any number of facets as part of your query, and get
back result counts for each of them.
• Solr provides a number of different ways to do faceting, we’ll look at
some of the most common.
Faceting – Field/Value
• This facets on a single field based on the value.
• For text fields, you typically don’t want to do any stemming, etc. but just
tokenize the value in the field as-is. In cases where you want to search on
the field and do more analysis as well, a copy field can be used for the
faceting.
• You can restrict matches based on criteria like a specific prefix or ones that
contain a specific string.
• You can set a specific number of matches needed to include the facet, and
even whether to include a count of documents that are missing a value for
the field.
Faceting – Field/Value Example Site
• Back to our classic car site. When viewing our product results, we also want
to get a list of the current categories the product appears in and use that for
a side menu.
• For this example we’ll just look at how we do the top-level category menu.
• As with groups, you generally want to facet on a field that has very little
tokenizing, etc. on it so that Solr returns the original, unmodified values in
the field.
• You can sort the facets by the count (highest first) or alphabetically (index
sort).
Faceting – Field/Value Example
?q=front+hood&json.facet={
categories: {type:terms, field:categoryname}
}
Sample Return:
categories= [ { val = ‘Body Components’, count=50 },
{ val = ‘Hood Components’, count=10 } ]
Faceting – Range
• Another common facet method is range facets. You can use ranges on
any date or numeric field.
• A common use case is to return counts in various price ranges
• In addition to start and end parameters, you can further configure how
Solr will facet the ranges by setting a gap to divide by and how to
handle edge cases.
• You can also configure it to include counts for values that fall outside
the range.
Faceting – Range Example
json.facet={ price_ranges = {'type' ='range',
'field' = 'price',
'start'= 0,
'end'= 1000,
'gap'= 250 }; }
Result:
price_ranges = [ { val = 0, count=50 },
{ val = 250, count=125 },
{ val = 500, count=72 },
{ val = 100, count=52 } ]
Faceting – Use With Filters
• One issue you may run into with facets is when you use them to provide
filters for the customer.
• When you apply the filter, the other options drop out of your search, as well
as your facets.
• For example, if I drill down into a category, the result set only has that
category in it (and so the facet for categories has only that one category as
well). But what if I want to still show a menu of ALL available categories for
that search so the user can change categories?
• Your first thought might be that we’ll have to do another search without the
filter… but WAIT! We actually can do this in the same search request.
Faceting – Domain
• Solr 5.2+ allows you to include the domain for the facet. This allows you to
expand your faceting outside the “domain” of the main search.
• To do this, we’ll first add a name to the filter that selects the category.
fq={!tag=cat}categoryid:100
• Now when we set the facet for the category, we can tell it to ignore the filter
for the category:
json.facet={
categories: { type:terms,
field:categoryname,
domain: { 'excludeTags' = 'cat ' } }
}
Faceting – Pivot Facets
• Also known as decision trees, these are multi-level facets.
• Return counts for field ‘foo’ for each different field ‘bar’ and so forth.
• Pivot facets can be requested fairly simply in a query, just by passing
the list of fields to pivot on:
facet.pivot=category,subcategory,subsubcategory
Faceting – Subfacets
• Subfacets are a newer version of pivot facets (Solr 5+).
• With subfacets, you can nest any kind of facet under any other kind of
facet, with completely separate settings and criteria of its own.
• Designed specifically with JSON in mind.
• Response format easier to handle.
• Lots of other improvements to allow for more advanced sorting,
calcuations, etc. on all levels of the facets.
Faceting – Example Subfacets
json.facet={
categories: { type:terms,
field:categoryname,
domain: { 'excludeTags' = 'cat,subcat ' },
facet: { subcategories: { type:terms,
field:subcategoryname,
domain: { 'excludeTags' = 'cat,subcat ' }
}
}
}
}
Faceting – More Info
• http://yonik.com/json-facet-api/
• http://yonik.com/multi-select-faceting/
• http://yonik.com/solr-subfacets/
• https://lucidworks.com/blog/2014/10/03/pivot-facets-
inside-and-out/
Spellchecking
• You can specify in your search that you want to receive spelling
suggestions in the results.
• Generally the field you want to use for spellchecking will have minimal
tokenizing and stemming on it. You often will want to use a copy field
that can be used specifically for the spellchecking.
• There’s a lot of parameters to set for spellchecking, which I won’t go
over here, but you’ll probably need to play around with them to see
what will work best for your application. There is some performance hit
for returning spell suggestions so you may want to turn them off when
not needed.
Spellchecking (cont)
• You need to make sure the spellcheck component is enabled for the
request handler you are using. If you don’t intend to change the
spellcheck parameters at all during searches, you may want to just set
them as defaults in the request handler.
• Be aware that the spellchecking dictionary is not built automatically.
You can send the paramater spellcheck.build=true on the url to rebuild
it, or in the solr config you can set it to be rebuilt automatically on
commit and/or optimize. Building on commit is generally not a good
idea in production systems.
Highlighting
• You can also have Solr highlight the terms it matched in the search results.
• Again, there’s a lot of ways to customize the highlighter component, both
what it highlights as well as what it wraps the matches in.
• Typically you will want to use termVectors, termPositions, and termOffsets in
your schema definition for the field(s) you will highlight terms in, which allows
you to use the FastVectorHighlighter component or with the standard
highlighter will improve performance.
• With the FastVectorHighlighter you can customize it to highlight matches
with different colors or html classes. It also supports Unicode.
Highlighting (cont)
• The highlighting does not use the same tokenizer/stemming components as
the search.
• Part of what you are configuring for the highlighting is how many highlighted
“snippets” to return and how Solr is to find them and pull them out of your
text fields.
• Solr returns the snippets separately so you’ll generally have to do a search-
and-replace on the original field to put the highlighted terms into it.
• With the FastVectorHighlighter you will want to be sure to include a
Boundary Scanner which ensures that it doesn’t truncate words.
Suggestor
• Used to provide automatic suggestions for query terms (auto-suggest search
box).
• While technically you could use the spellchecker for this, the suggestor is
particularly developed for this use.
• As with the spellchecker, you can configure when the suggestor’s dictionary
is built and you typically will want to copy fields to a field type specifically set
up for this purpose that has minimal analysis on it.
• There are multiple kinds of dictionaries available to use for the suggestor,
and you can get suggestions from more than one in a single request.
MoreLikeThis
• Enables users to search for other documents similar to one in their
current results list.
• You can customize how this component works in many ways, from
which fields to use, number of documents to return, term frequency
requirements, minimum and maximum word lengths to use, boosting
and more.
But Wait, There’s More!
• Pagination and Cursors
• Query Re-Ranking
• Transforming Results
• Result Clustering (tag cloud)
• Spatial/Geospatial Searches
• Term and Term Vectors
Components
• Stats Component
• Caching
• Query Elevation
• RealTime Get
• Exporting Result Sets
• Distributed Search and Index
Sharding
• Content Streams
This covers only a portion of the features of Solr.
Some things we didn’t look at include:
Need More Help?
CFWebtools, LLC
11204 Davenport, Ste. 100
Omaha, NE 68154
402 408 3733 ext. 2
https://www.cfwebtools.com

Mais conteúdo relacionado

Semelhante a Solr/Elasticsearch for CF Developers (and others)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
How do Solr and Azure Search compare?
How do Solr and Azure Search compare?How do Solr and Azure Search compare?
How do Solr and Azure Search compare?SearchStax
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunktdthomassld
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrJayesh Bhoyar
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0Anshum Gupta
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 

Semelhante a Solr/Elasticsearch for CF Developers (and others) (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
How do Solr and Azure Search compare?
How do Solr and Azure Search compare?How do Solr and Azure Search compare?
How do Solr and Azure Search compare?
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
Solr Introduction
Solr IntroductionSolr Introduction
Solr Introduction
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr 101
Solr 101Solr 101
Solr 101
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Solr5
Solr5Solr5
Solr5
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 

Último

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Solr/Elasticsearch for CF Developers (and others)

  • 2. Who Am I? • Senior Web Developer at CFWebtools, LLC • ColdFusion Developer since CF3/Allaire • Cosplayer • Dog Trainer • Sewer/Baker/Knitter/Origamist/Ass. Crafts • Cancer Survivor • Fibromyalgia/Invisibile Disabilities Advocate
  • 3. What is Solr? • Standalone Full-Text Search Engine with Apache Lucene Backend • Open-source, distributed, highly scalable, enterprise grade search • http://lucene.apache.org/solr/ • Included in ColdFusion since CF9 replacing Verity (cfsearch/cfindex) • Thru CF11 – Solr 3 • CF2016 – Solr 5 (5.1.2) • Current Release of Solr is version 6.2.0
  • 4. Why use Solr/ES instead of CF tags? • Any CF version prior to 2016 has ancient Solr 3.x versions • Full Access to latest Solr/ES versions and patches • Ability to use cloud-based distributed setups (essential for enterprise sites) • Access to far more features and use of REST/JSON • Code more easily converted to other search engines and languages
  • 5. Solr vs. ElasticSearch – What should I use? • Solr has been around a lot longer so more mature, very well documented and has strong backwards compatibility. Developers often mention ES not nearly as well documented so plan on investing in other sources to really get a handle on it. • ES being younger is built on modern standards and ideas (particularly REST), designed specifically for handling large indices and high query rates, and since it isn’t as strictly commuinty driven often can move forward quicker with new features, bug fixes, etc. (although co • Both have very active communities and are very actively still being developed and moved forward. Solr in particular has pretty much caught up to many of the advances brought by ElasticSearch entering the marketplace such as full REST support.
  • 6. Solr vs. ElasticSearch – What should I use? (cont) • Solr excels at text-search applications, ElasticSearch for analytics (lots of monitoring and metrics exposed). • In areas like log analysis, ES is by far the more common choice to use. This is due to its very advanced “aggregations” framework, which replaced earlier faceting. • https://www.elastic.co/blog/out-of-this-world-aggregations • Solr uses a terse syntax, vs. ES which is much more verbose • This makes ES generally more readible, but the terse syntax of Solr make more advanced relevancy possibilites easier to handle, of particular interest in text-search applications.
  • 7. Solr vs. ElasticSearch – What should I use? (cont) • SearchComponents in Solr allow for much more easily customizable searches that can be easily reused across multiple applications or within an application. • ES generally considered a bit easier to get started with and do clustering etc. Solr generally requires a bit more work to get your head around, forces you to read over and learn the config files to get running for instance - but this is not necessarily a BAD thing. • If you are going to use REST, both now have excellent support although ES more REST-compliant. But if you plan to go another route, Solr tends to have better support, for instance it has excellent Java support via the Solrj library.
  • 8. Amazon CloudSearch • There are some other Lucene-based searches you can consider. • Most popular of these is Amazon CloudSeach • Easy to set up, AWS managed service with automatic scaling • Provides most commonly used text-search features like highlighting, autocomplete, simple faceting, grouping, geospatial search, etc. • It is considerably more limited that Solr and ES when it comes to doing advanced search relevancy tuning and/or advanced metrics.
  • 9. Solr vs. ElasticSearch – More Reading • http://solr-vs-elasticsearch.com/ • https://sematext.com/blog/2015/01/30/solr-elasticsearch-comparison/ • https://www.datanami.com/2015/01/22/solr-elasticsearch-question/ • http://opensourceconnections.com/blog/2015/12/15/solr-vs-elasticsearch- relevance-part-one/ • http://opensourceconnections.com/blog/2016/01/22/solr-vs-elasticsearch- relevance-part-two/ • http://harish11g.blogspot.com/2015/07/amazon-cloudsearch-vs- elasticsearch-vs-Apache-Solr-comparison-report.html
  • 10. So let’s look at the features of Solr (particularly Solr 4+ versions) • Full REST API for schema management, indexing, searching, etc. (Solr 5+) • Wide variety of built in tokenizers and analyzers • Grouping, faceting, highlighting, spelling suggestions, autocomplete • Filtering, document and field boosting, custom ranking, etc. • Near Real-Time Indexing • Extensible Plugin Architecture
  • 11. Solr 6 Features • Parallel SQL – The big WOW feature of version 6 is bringing SQL support to Solr which works across SolrCloud collections. This is done by a SQL parser that converts SQL queries to Solr streaming expressions. • SQL Request Handler – SolrCloud collections can be queried with standard SQL language using the /sql request handler. • JDBC Driver – Connect to the SolrCloud collections with any tool that supports JDBC and query the collection directly • Still somewhat experimental and not quite ready for primetime usage but improving rapidly.
  • 12. Solr 6 Features (cont) • Many other improvements and advancements with streaming expressions (Merging search with parallel computing, across multiple sources) • Push/Pull streaming, request/response streaming • Solr collections able to auto-update itself via these kinds of streaming commands. • https://sematext.com/blog/tag/streaming-expressions/
  • 13. Let’s Focus on Text Searches! • This is primarily what CF developers would have been using Solr integration for – cfsearch/cfindex • While a number of things we’re going to look at are included in the CF integration, you can do a lot more once you move to standalone Solr.
  • 14. Our Target Site and Objectives • Classic industries (classicindustries.com) • Ecommerce Site for Classic Car Parts • Customers can select a car model (catalog) and year to filter their search • Single text box search that needs to search across multiple fields but return the best possible matches • Search pages need to also include data like breadcrumb trail, category menus, nested structure of category totals, etc. • We would like to add additional elements like spelling suggestions and highlighting.
  • 15. Step 1 - Schema • The schema defines the fields and their types that will be indexed for searching. • Solr/ES can both be used schema or schema-less, support dynamic field types, etc. Typically you would only use schema-less for development and then switch to the managed schema for production. • Solr 5 and up can handle most schema changes via the REST service. • You can also make schema changes via the Solr admin console
  • 16. Sample REST – Add Field Type POST /schema Content-type: application/json { "add-field-type": { "name": "simpleTextSpell", "class": "solr.TextField", "positionIncrementGap":100, "indexAnalyzer": { "tokenizer": { "class": "solr.StandardTokenizerFactory" }, "filters": [ { "class": "solr.LowerCaseFilterFactory" }, { "class": "solr.RemoveDuplicatesTokenFilterFactory" } ] } } }
  • 17. Sample REST – Add Field POST /schema Content-type: application/json { "add-field": { "name": "simpleSpell", ”type": "simpleTextSpell", ”indexed": true, ”stored”: true } }
  • 18. Schema - Analyzers, Tokenizers and Filters • These are used to tell Solr how to prepare the text string for indexing (and/or quering). • Proper handling of this step is essential for good search results. • While Solr has a lot of built-in field types for text fields, you may oftenneed to add your own field types to get the best results. • With simple text fields, you often will use the same analyzer for the indexing and the query steps. The more complex handling your field needs, the more likely you may need different analyzers for the indexing vs. the querying.
  • 19. Schema - Sample Analyzers <fieldType name="nametext" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/> </analyzer> <analyzer type="query”> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 20. Schema - Tokenizers • Tokenizers determine how the text string will be split up into “tokens”. A common first step is to split the string on whitespace and/or punctuation (sentences, etc. split into the individual words so we can search on them instead of the entire */string). • Solr includes a whole variety of tokenizers includes ones designed for specific kinds of data, like file paths or email addresses, as well as to handle multi-lingual text. • You can also process your token using regular expressions…. Or return the entire field as a single token.
  • 21. Schema - Filters • Filters are used after tokenizing to further manipulate your data. • Some common filters actions are to convert everything to lowercase so searches aren’t case-sensitive and discarding common words that aren’t useful in searches (a, and, the, etc.) • A dictionary filter might be used on a field that you intend to use for spelling corrections. • You can also use a synonym filter to create word mappings to match on.
  • 22. Schema - CharFilters • Unlike regular filters, charFilters are used PRIOR to tokenizing your data. • You might use these to do things like strip out HTML tags, comments, or any other text you don’t want your search to find. • Solr includes both a charFilter specifically for removing HTML markup as well as a Regex style replace filter.
  • 23. Schema - Copy Fields • Your schema can include fields which just copy data from other fields to index. Typically they will have a different set of analyzers to manipulate the data. • For example, I may want my search to return matches on the original words higher than the ones that match on a synonym. To do this, I would use a field copy for the synonym matches which will give lower ranking. • Another common use is for spell checking in which you may want to copy text fields that use different field types to the one you use for spellchecking.
  • 24. Schema in Place, Let’s Index Our Data! • In the past most CF code would use the SolrJ library for standalone Solr work. • This is still an option but now we have REST as an alternative. The REST libraries are generally much easier to work with since you don’t have to figure out all the nested methods that are holding your data, it’s just all returned in a simple JSON object. • There is very little now in Solr that you cannot do through REST, including adding or modifying cores, making all schema changes, and of course indexing and searching.
  • 25. Schema in Place, Let’s Index Our Data! (cont) • Solr’s REST integration has been continually improved but you may still run into some gotchas. For instance, the handling of multiple documents in an indexing request when you need to include additional parameters like a custom boost makes it impossible in most languages to simply convert a native object to the necessary JSON object (multiple name-value pairs in an object with the same name). • ColdFusion has its own quirks (bugs) that you have to watch out for. The most common one you’ll run into is it trying to treat a string that is all numbers as a numeric value and not wrapping it in quotes. A typical hack is to add some string in front of any such field prior to CF serializing and then doing a search and replace to remove it prior to the REST request.
  • 26. Sample REST – Indexing Data POST //update?wt=json Content-type: application/json { "add": { { "doc" : { ”productname": ”Sample AC Part", “sku”: “AC234”, ”catalogs" : [”Camaro”, “Impala”,”Firebird”], "id" : 8753 } } , “boost”: 2.0 } "add": { { "doc" : { ”productname": ”Sample Body Part", “sku”: “BD495”, ”catalogs" : [”Camaro”], "id" : 968944} } } }
  • 27. More on Indexing • Adding and updating data uses the same request format, if you include a key that is already in the index, Solr will update it. • However keep in mind that if you are making schema or major data changes, that updating won’t REMOVE old keys. To do so you need to either locate those and send delete requests, or you need to purge your data and then do a clean re-indexing (advantage of SolrCloud over using same server to index and search). • You can delete either by key, or by query. For example, purge all data with deleteByQuery('*.*')
  • 28. More on Indexing (cont.) • You can also index bulk data using a data import handler, which can be done in a number of formats. • By default Solr doesn’t have any security on the admin or managed schema, so you will want to lock it down for production servers. • The solr config allows you to specific auto-commit times, replication to slave servers, etc. • Soft auto-commits can be used so updates can be made live almost immediately without the overhead of doing a hard commit (near real- time search).
  • 29. We Have Data, Now Let’s Search! • Solr comes with some built-in request handlers you can use or customize, or you can add your own. • The request handler configuration determines what settings, defaults and components (like spellcheck) are available for requests to that handler. • The simpliest search is just to send the query parameter “q” to the search request: /select?q=front+bumper • Search results can be returned in a variety of formats, including json, xml, csv and language-specific formats like php and ruby.
  • 30. Query Parsers • The parser used determines what parameters you can use. • For text searching you generally will use the Dismax or Extended Dismax parsers, which allow for improving the relevance of your search results. • Dismax includes term boosting, phrase boosting and minimum-should- match parameters among others. • Extended Dismax extends this with even more boosting options including field boosting, more phrase boosting options, proximity boosting, and ignoring stopwords at query time.
  • 31. Filters • All types of Solr query parsers support filters. • This is the most basic way of restricting what documents to search. • In our sample site application, we add filters based on things like the catalog and year the user selects, if they are looking for new or outlet products, if they have drilled down into a category, etc. • You can have any number of filters and they can include complex boolean expressions.
  • 32. Filter Examples • fq=catalogid:1 • fq=year:1967 • fq=newproduct:true • fq=catalogid:(1 TO 15) • fq=(discontinueflg:N OR availablecount:[1 TO *])
  • 33. Search Relevance • This is a topic we could spend an entire day on. • Many enterprise sites have regular search audits and do extensive analysis to look at their relevancy scores and how to improve them. • We’ll take a quick look at our example site and some things that Solr allows us to do in order to improve our search relevancy. • We are using Extended Dismax for the maximum possible options for controlling search relevancy.
  • 34. Search Relevance (cont) • By default Solr is scoring documents by how many times the search terms are found. • What we want to do is “boost” fields and documents, etc. that we want Solr to place more emphasis on. • We want to also look at how to handle searches that include multiple terms to search on (phrases).
  • 35. Search Relevance – Example App Fields • Product Number (SKU) – we want to put matches on the SKU right at the top in searches. • Product Name – next most important is matches on the product name. All the most relevant search terms are included in the product name. • Keywords – custom keywords have additional search terms and abbreviations we want to match for the product so are fairly relevant. • Product Info – this is full description of the product that can be used for searches but due to the extensive amount of text and non-related words it can have, it’s of fairly low importance
  • 36. Search Relevance – Synonyms • Solr has support for synonyms which allow you to map words to similar ones that you want it to also consider a match. • Synonyms can be one-way or bi-directional. For instance if there is a common misspelling people use in a search, you would map that in one directon only, to the correct spelling. • Solr does not properly handle multi-term synonyms (see the ‘sea biscuit’ problem). This is a long-standing bug and there are some plugins to try and correct for it but they often result in issues with more complex relevancy setups.
  • 37. Search Relevance – Sample App Synonyms • Since we want matches on the original search term to always appear higher than matches for synonyms, we need to copy the fields used so we can boost them separately. These fields only need to be indexed, not stored. • prodnamesynonym – Product Name Synonym Field. This will get a boost high enough to help matches appear above most of the other fields, but not as high as the original product name field. • proddatasynonym – Additional Product Data Synonym Field. We’ll copy all the other text fields to this one and give it the lowest boost score.
  • 38. Search Relevance – Boosting • The default value that Solr gives for boosts is 1.0 • Solr does not support negative boosts but anything below 1.0 is basically a negative boost based on the default. • Keep in mind as well that Solr is going to score documents on how often the search terms appear as well. You can use a filter in your schema to remove duplicate tokens if you don’t want it to do this. • The boosts for your query are set on the “qf” parameter which tells Solr which fields you want to query.
  • 40. Search Relevance – Phrase Boosting • Phrase boosting is used for multi-term searches. • By default, Solr will score documents the same no matter where the search terms appear in the documents. • Phrase boosting allows you to score higher the documents where the search terms are appearing next to, or close to, each other. • The original phrase boost from the Dismax query parser boosts only for all search terms being close together. Edismax adds options for 2 and 3-word phrases in your search terms.
  • 41. Search Relevance – Phrase Boosting (cont) • The phrase slop setting is used to set how far away terms can be in order to be consider a match for phrase boost. • If you set 2 and 3 word phrase boosting, you can use different slop settings for them. • Phrase boosting doesn’t have any effect on what documents are returned by the search, ONLY how they get scored.
  • 42. Sample App Phrase Boosting prodname^50.0 prodnamesynonym^25.0 keywords^10.0 productinfo^5.0 proddatasynonym^0.25 http://localhost:8983/solr/classic/select?q=front+bumper&defType=edismax&pf=proddatasynonym^0.25+productinfo^5+keywords^10+prodname synonym^25+prodname^50&pf2=proddatasynonym^0.25+productinfo^5+keywords^10+prodnamesynonym^25+prodname^50&pf3=proddatasy nonym^0.25+productinfo^5+keywords^10+prodnamesynonym^25+prodname^50&ps=1
  • 43. Search Relevance – More Boosting • There are other boosting options, such as boosting a specific term in the search, telling Solr to boost the documents that match that particular term over other terms in the search. • You can also create complex functions for boosting documents. • Another common boost is to apply one during indexing to specific documents. For instance in our classic car site, we apply a boost to the products that are our best sellers, so that all things being equal, those products will appear higher in searches.
  • 44. Search Relevance – More Boosting • There are other boosting options, such as boosting a specific term in the search, telling Solr to boost the documents that match that particular term over other terms in the search. • You can also create complex functions for boosting documents. • Another common boost is to apply one during indexing to specific documents. For instance in our classic car site, we apply a boost to the products that are our best sellers, so that all things being equal, those products will appear higher in searches.
  • 45. Minimum-Should-Match • Solr supports both AND and OR searches which you can set with the q.op parameter. • When doing an OR search with multiple search terms, you can also set the minimum number of terms that have to be matched to return a document. • For instance if mm=75% and the user enters 4 search terms, then only 3 have to match to return the document. • This helps ensure that as users enter more search terms that have a higher chance of not matching, you can make sure the results still require matching on more than just a single term (typical OR search) without causing a high percentage of failed searches. • There are various options for how to set the minimum match, and you can customize it based on the number of search terms that were entered.
  • 46. Example Minimum-Should-Match • Match 75% of the terms mm=75% • 25% of the terms can be missing (same result as previous) mm=-25% • For more than 1 term, allow 1 to be missing. For more than 4 terms, allow 2 mising. For more than 6 terms, allow 33% missing. mm=1<-1 4<-2 6<-33%
  • 47. Result Grouping • Allows you to group results with a similar value in a specific field together. • This is something you may commonly have to do for something like an ecommerce site where a single product appears in multiple categories but you only want to show (or count) one copy of each product. • Solr gives you a very wide range of options over how to handle the groups. • You can only group on a single field which should be indexed and a text field. • The best way to show off what grouping can do is look at our example site.
  • 48. Result Grouping - Example • Our classic car parts site has products that appear in multiple car model catalogs as well as in multiple categories (3-tiers of categories). • We’ve indexed them based on this combination of unique product id, catalog id and 3 category ids. • So when we search, we may get multiples of a specific product and need to group on the product id to get accurate product counts. • We also want to simplify the groups as much as possible so that we don’t have to do a lot of extra work to get to the data.
  • 49. Group Parameters for Search <cfscript> var params = {}; params["group"] = true; params["group.field"] = ‘product_id’; // tells Solr to return the total number of groups found //this will be our total number of products found params["group.ngroups"] = true; //this gives us the grouped documents in a flat list params["group.format"] = 'simple'; //since the groups are just copies of the same product, //we only need one document in each group params["group.limit"] = 1; </cfscript>
  • 50. Result Grouping – Cont. • This is just one way to use groups. You can create groups based on functions and do much, much more with them. • You can sort both the list of groups (based on the most relevant document in each) as well as inside the groups themselves. • Likewise paging can be done both inside and external to the grouping.
  • 51. Collapse and Expand Results • These are an alternative to result grouping that are useful particularly for displaying collapsed search results. • You provide the field to “collapse” on and Solr will provide results with a single document per group for that field. • The expand is then used to tell Solr to return the same query but this time with an “expanded” section that includes all the documents in the groups.
  • 52. Faceting • Faceting is widely used in commerce sites to show the customer result counts for numerous search criteria. • This is where a search engine like Solr really shines over using database-only solutions that require additional queries and often extensive processing to come up with the same data. • You can request any number of facets as part of your query, and get back result counts for each of them. • Solr provides a number of different ways to do faceting, we’ll look at some of the most common.
  • 53. Faceting – Field/Value • This facets on a single field based on the value. • For text fields, you typically don’t want to do any stemming, etc. but just tokenize the value in the field as-is. In cases where you want to search on the field and do more analysis as well, a copy field can be used for the faceting. • You can restrict matches based on criteria like a specific prefix or ones that contain a specific string. • You can set a specific number of matches needed to include the facet, and even whether to include a count of documents that are missing a value for the field.
  • 54. Faceting – Field/Value Example Site • Back to our classic car site. When viewing our product results, we also want to get a list of the current categories the product appears in and use that for a side menu. • For this example we’ll just look at how we do the top-level category menu. • As with groups, you generally want to facet on a field that has very little tokenizing, etc. on it so that Solr returns the original, unmodified values in the field. • You can sort the facets by the count (highest first) or alphabetically (index sort).
  • 55. Faceting – Field/Value Example ?q=front+hood&json.facet={ categories: {type:terms, field:categoryname} } Sample Return: categories= [ { val = ‘Body Components’, count=50 }, { val = ‘Hood Components’, count=10 } ]
  • 56. Faceting – Range • Another common facet method is range facets. You can use ranges on any date or numeric field. • A common use case is to return counts in various price ranges • In addition to start and end parameters, you can further configure how Solr will facet the ranges by setting a gap to divide by and how to handle edge cases. • You can also configure it to include counts for values that fall outside the range.
  • 57. Faceting – Range Example json.facet={ price_ranges = {'type' ='range', 'field' = 'price', 'start'= 0, 'end'= 1000, 'gap'= 250 }; } Result: price_ranges = [ { val = 0, count=50 }, { val = 250, count=125 }, { val = 500, count=72 }, { val = 100, count=52 } ]
  • 58. Faceting – Use With Filters • One issue you may run into with facets is when you use them to provide filters for the customer. • When you apply the filter, the other options drop out of your search, as well as your facets. • For example, if I drill down into a category, the result set only has that category in it (and so the facet for categories has only that one category as well). But what if I want to still show a menu of ALL available categories for that search so the user can change categories? • Your first thought might be that we’ll have to do another search without the filter… but WAIT! We actually can do this in the same search request.
  • 59. Faceting – Domain • Solr 5.2+ allows you to include the domain for the facet. This allows you to expand your faceting outside the “domain” of the main search. • To do this, we’ll first add a name to the filter that selects the category. fq={!tag=cat}categoryid:100 • Now when we set the facet for the category, we can tell it to ignore the filter for the category: json.facet={ categories: { type:terms, field:categoryname, domain: { 'excludeTags' = 'cat ' } } }
  • 60. Faceting – Pivot Facets • Also known as decision trees, these are multi-level facets. • Return counts for field ‘foo’ for each different field ‘bar’ and so forth. • Pivot facets can be requested fairly simply in a query, just by passing the list of fields to pivot on: facet.pivot=category,subcategory,subsubcategory
  • 61. Faceting – Subfacets • Subfacets are a newer version of pivot facets (Solr 5+). • With subfacets, you can nest any kind of facet under any other kind of facet, with completely separate settings and criteria of its own. • Designed specifically with JSON in mind. • Response format easier to handle. • Lots of other improvements to allow for more advanced sorting, calcuations, etc. on all levels of the facets.
  • 62. Faceting – Example Subfacets json.facet={ categories: { type:terms, field:categoryname, domain: { 'excludeTags' = 'cat,subcat ' }, facet: { subcategories: { type:terms, field:subcategoryname, domain: { 'excludeTags' = 'cat,subcat ' } } } } }
  • 63. Faceting – More Info • http://yonik.com/json-facet-api/ • http://yonik.com/multi-select-faceting/ • http://yonik.com/solr-subfacets/ • https://lucidworks.com/blog/2014/10/03/pivot-facets- inside-and-out/
  • 64. Spellchecking • You can specify in your search that you want to receive spelling suggestions in the results. • Generally the field you want to use for spellchecking will have minimal tokenizing and stemming on it. You often will want to use a copy field that can be used specifically for the spellchecking. • There’s a lot of parameters to set for spellchecking, which I won’t go over here, but you’ll probably need to play around with them to see what will work best for your application. There is some performance hit for returning spell suggestions so you may want to turn them off when not needed.
  • 65. Spellchecking (cont) • You need to make sure the spellcheck component is enabled for the request handler you are using. If you don’t intend to change the spellcheck parameters at all during searches, you may want to just set them as defaults in the request handler. • Be aware that the spellchecking dictionary is not built automatically. You can send the paramater spellcheck.build=true on the url to rebuild it, or in the solr config you can set it to be rebuilt automatically on commit and/or optimize. Building on commit is generally not a good idea in production systems.
  • 66. Highlighting • You can also have Solr highlight the terms it matched in the search results. • Again, there’s a lot of ways to customize the highlighter component, both what it highlights as well as what it wraps the matches in. • Typically you will want to use termVectors, termPositions, and termOffsets in your schema definition for the field(s) you will highlight terms in, which allows you to use the FastVectorHighlighter component or with the standard highlighter will improve performance. • With the FastVectorHighlighter you can customize it to highlight matches with different colors or html classes. It also supports Unicode.
  • 67. Highlighting (cont) • The highlighting does not use the same tokenizer/stemming components as the search. • Part of what you are configuring for the highlighting is how many highlighted “snippets” to return and how Solr is to find them and pull them out of your text fields. • Solr returns the snippets separately so you’ll generally have to do a search- and-replace on the original field to put the highlighted terms into it. • With the FastVectorHighlighter you will want to be sure to include a Boundary Scanner which ensures that it doesn’t truncate words.
  • 68. Suggestor • Used to provide automatic suggestions for query terms (auto-suggest search box). • While technically you could use the spellchecker for this, the suggestor is particularly developed for this use. • As with the spellchecker, you can configure when the suggestor’s dictionary is built and you typically will want to copy fields to a field type specifically set up for this purpose that has minimal analysis on it. • There are multiple kinds of dictionaries available to use for the suggestor, and you can get suggestions from more than one in a single request.
  • 69. MoreLikeThis • Enables users to search for other documents similar to one in their current results list. • You can customize how this component works in many ways, from which fields to use, number of documents to return, term frequency requirements, minimum and maximum word lengths to use, boosting and more.
  • 70. But Wait, There’s More! • Pagination and Cursors • Query Re-Ranking • Transforming Results • Result Clustering (tag cloud) • Spatial/Geospatial Searches • Term and Term Vectors Components • Stats Component • Caching • Query Elevation • RealTime Get • Exporting Result Sets • Distributed Search and Index Sharding • Content Streams This covers only a portion of the features of Solr. Some things we didn’t look at include:
  • 71. Need More Help? CFWebtools, LLC 11204 Davenport, Ste. 100 Omaha, NE 68154 402 408 3733 ext. 2 https://www.cfwebtools.com