Presented on 10/11/12 at the Boston Elasticsearch meetup held at the Microsoft New England Research & Development Center. This talk gave a very high-level overview of Elasticsearch to newcomers and explained why ES is a good fit for Traackr's use case.
2. About Traackr
A search engine
A people discovery engine
Subscription-based
Migrated from Solr to
Elasticsearch in Q3 ’12
3. About me
14+ years of experience building
full-stack web software systems
with a past focus on e-
commerce and publishing
VP Engineering @ Traackr,
responsible for building
engineering capability to enable
Traackr's growth goals
about.me/george-stathis
4. About this talk
Short intro to Elasticsearch
How search is done @ Traackr
Why Elasticsearch was the right fit
5. About Elasticsearch
Lucene under the covers
Distributed from the ground up
Full support for Lucene Near Real-Time search
Native JSON Query DSL
Automatic schema detection (“schema-less”)
Supports document types
6. Elasticsearch - Distributed
Indices broken into shards
shards have 0 or more replicas
data nodes hold one or more shards
data nodes can coordinate/forward
requests
automatic routing & rebalancing but
overrides available
Default mode is multicast (zen
discovery), unicast available for
multicast unfriendly networks, AWS
plug-in available, Zookeeper plug-in
available made possible by Sonian.
YouTube demo: http://youtu.be/ Source: https://confluence.oceanobservatories.org/display/CIDev/Indexing+with+ElasticSearch
l4ReamjCxHo
7. Elasticsearch - NRT
Uses Lucene’s IndexReader.open(IndexWriter
writer, boolean applyAllDeletes)
Opens a near real time IndexReader from the
IndexWriter
By default, flushes and makes new updates available
every second
9. Elasticsearch - JSON DSL (cont)
# Filtered Query
# Filters are similar to queries, except they do no scoring
# and are easily cached.
# There are many filter types as well, including range and term
curl 'localhost:9200/test/_search?pretty=1' -d '{
"query" : {
"filtered" : {
"query" : {
"query_string" : {
"query" : "tags:scala"
}
},
"filter" : {
"range" : {
"price" : { "gt" : 15 }
}
}
}
}
}' Source: https://github.com/kimchy/talks/blob/master/2011/wsnparis/06-search-querydsl.sh
10. Elasticsearch - Schema
Dynamic object mapping with intelligent defaults
Can be turned off
Can be overridden globally or on a per index basis:
{
"_default_" : {
"date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy", "date_optional_time"],
}
}
13. Traackr search requirements
Posts are coming in at about 1 million a day
Each author averages several hundred posts
Posts need to be available for search immediately
Relevance and sorting has to be rolled up/grouped at
the author level
14. Early approach to search
search posts
group matched posts by author
for each grouped set, add up the
lucene scores of the posts
combine sum of post scores with
author social and website metrics
for final group score
sort groups (i.e. authors)
try to do this quickly!
15. Early approach to search
search posts
group matched posts by author
for each grouped set, add up the
lucene scores of the posts
combine sum of post scores with
Performance hit
author social and website metrics
for final group score
sort groups (i.e. authors)
try to do this quickly!
16. Room for improvement
How can we avoid the “late binding” performance
penalty?
Get the search engine to do as much of the scoring
as possible
Store all data needed for displaying results in the
search engine (i.e. no db calls)
17. Alternatives - Denormalize?
Index authors and their posts together
under one document.
Pros
straight forward
built-in post relevance sum
Cons
each profile change would trigger the
reindexing of all the author’s posts
each new post would trigger the re-
indexing of all the author’s posts +
profile
a non-starter for real-time search
18. Alternatives - Solr Join?
“In many cases, documents have relationships between them and it is too expensive to denormalize
them. Thus, a join operation is needed. Preserving the document relationship allows documents to
be updated independently without having to reindex large numbers of denormalized documents.” -
http://wiki.apache.org/solr/Join
E.g. Find all post docs matching "search engines", then join them against author docs and return
that list of authors:
...?q={!join+from=author_id+to=id}search+engines
Pros
addresses the issue of loading author profiles from db
Cons
Does not preserve the post relevance scores -> non-starter
Submit patch to get scores? Wouldn’t touch SOLR-2272 with a ten foot pole:
19. Alternatives - Solr Grouping?
Groups results by a given document field (e.g. author_id)
http://wiki.apache.org/solr/FieldCollapsing
...&q=real+time+search&group=true&group.field=author_id
[...]
"grouped":{
"author_id":{
"matches":2,
"groups":[{
"groupValue":"04e3bc5078344ad1a065815f0bb9f14d",
"doclist":{"maxScore":3.456747, "numFound":1,"start":0,"docs":[
{
"id":"5d09240934eb331bada1ff3f0b773153",
"title":"Refresh API",
"url":"http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh.html",
"author_id":"04e3bc5078344ad1a065815f0bb9f14d"}]
}},
{
"groupValue":"9e4f40e1aa82f2e1a9368748d1268082",
"doclist":{"maxScore":2.456747,"numFound":2,"start":0,"docs":[
{
"id":"831ce82bdff34abeb495f260bc7d67d2",
"title":"Realtime Search: Solr vs Elasticsearch"},
"url":"http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/",
"author_id":"9e4f40e1aa82f2e1a9368748d1268082"},
[...]]
}}]}}
20. Alternatives - Solr Grouping?
Pros
Faster than doing grouping at the app layer: no
need for post counting
Possible to sort groups by sum of post relevance
scores inside the engine (with some custom
work):
Cons
No concept of author; author profiles still need to
be fetched from db, so still suffers from some
performance penalty
Submit patch for group sort options? Not a lot of
interest in sorting groups by anything other than
max score:
Don’t want to be stuck maintaining custom
Solr code (been there done that with HBase:
http://www.slideshare.net/gstathis/finding-
the-right-nosql-db-for-the-job-the-path-to-a-
nonrdbms-solution-at-traackr )
21. Alternatives - Elasticsearch!
Supports document types {
and parent/child document "post" : {
"_parent" : {
mappings: http:// "type" : "author"
www.elasticsearch.org/guide/ }
reference/mapping/parent- }
}
field.html
Out-of-the-box support for curl 'localhost:9200/traackr/_search?pretty=1' -d
'{
querying child documents "query": {
and obtaining their parents: "top_children": {
http://www.elasticsearch.org/ "type": "post",
"query": {
guide/reference/query-dsl/ "query_string": {
top-children-query.html. "query": "elasticsearch NRT"
}
Con: memory heavy }, can order parent
"score": "sum" results by sum of
} child scores!
Parent documents can be }
sorted but sum/avg/max of }'
22. Alternatives - Elasticsearch!
Supports document types {
and parent/child document "post" : {
"_parent" : {
mappings: http:// "type" : "author"
www.elasticsearch.org/guide/ }
reference/mapping/parent- }
}
field.html
Out-of-the-box support for curl 'localhost:9200/traackr/_search?pretty=1' -d
'{
querying child documents "query": {
and obtaining their parents: "top_children": {
http://www.elasticsearch.org/ "type": "post",
"query": {
guide/reference/query-dsl/ "query_string": {
top-children-query.html. "query": "elasticsearch NRT"
}
Con: memory heavy }, can order parent
"score": "sum" results by sum of
} child scores!
Parent documents can be }
sorted but sum/avg/max of }'
Big win
24. Other Elasticsearch benefits
Lucene: don’t have to give up query syntax if you come from Solr
In-JVM nodes: can use Java API to unit test different permutations of indexing
configurations (e.g. different analyzers and tokenizers): great help for testing search
on a qualitative basis; allows for embedded ES instances
Index API and Cluster API: a great deal of cluster and index configuration changes
can be made on the fly through curl API calls without restarting the cluster; very
convenient for testing and cluster management
Warmer API: significant help in avoiding search time drops due to segment merges;
https://github.com/elasticsearch/elasticsearch/issues/1913
Percolators: register queries and let the engine tell you which queries match on a
given document; great potential for real-time; http://www.elasticsearch.org/guide/
reference/api/percolate.html
- important to differentiate with Solr Cloud\n - Solr Cloud (in trunk but not quite out yet; will come out with Lucene 4.0)\n - Solr Cloud uses Zookeeper to coordinate the cluster, ES it’s built-in every node (issue with nodes losing connectivity with cluster, electing themselves as master, ES can use ZK as a plugin)\n - ES uses multicast, so if network does not support it, need to switch to unicast\n - Both support distributed NRT\n- refer to http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/\n
\n
\n
\n
- talk about how ES differs from Solr in that it detects the fields based on the content; Solr has the wildcard definitions.\n- Solr schema.xml vs. ES REST API driven JSON DSL config which can be dynamic\n
if curl statements get snoozes, show real app demo\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
if curl statements get snoozes, show real app demo\n
Percolators? Don’t trigger when a record is available for searching (Igor’s comment)\n