The document discusses why a dedicated search engine like Elasticsearch is better than a traditional database for search tasks. It explains that databases are optimized for data storage and retrieval by unique IDs, but are slow and inefficient for full text search. Elasticsearch uses an inverted index which allows it to quickly search text fields and return relevant results. It analyzes, normalizes, and indexes documents upfront so queries can be executed rapidly against the index. Ranking algorithms ensure the most relevant documents are prioritized in results.
3. Why Search?
• What does a dedicated search engine do?
o that a database doesn’t?
• Why not [MySQL|mongoDB|Cassandra | etc]?
• Why a dedicated search engine?
OpenSource Connections
4. Why not MySQL?
• We’ve got rows of stuff in tables. IE for SciFi
StackExchange, we’ve stored ~20K posts:
PostID
UserId
CreationDate
ViewCount
Body
0
1
2011-01124
11T20:52:46.75
3
<p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>
1
2
2013-02525
01T12:44:46.52
5
<p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>
OpenSource Connections
5. Why not MySQL?
• Our mission: Find all the “Darth Vader” in SciFi
StackExchange Posts!
P U C V Body
0 1 2 1 <p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>
1 2 2 5 <p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>
Found!
Missing!
OpenSource Connections
6. Why not MySQL – SQL Like?
• SQL “LIKE” operator – scan all rows for a specific
wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
Performs Table Scan
Match?
Match?
Match?
Match?
Approx 300ms to search a measly 20K docs!
(what if we had 20 Million?)
OpenSource Connections
7. SQL Like – other problems
• Can’t search for words out –of-order:
SELECT * FROM posts WHERE body LIKE "%vader,
darth%"
0 results
• Can’t search for alternate forms of a word:
SELECT * FROM posts WHERE body LIKE "%kittie
pictures%‚
SELECT * FROM posts WHERE body LIKE "%kitteh
pictures%"
OpenSource Connections
8. SQL Like – other problems
• No Ranking of Results – given these two docs:
I seem to remember a novel, I
think it was Dark Lord: The
Rise of Darth Vader, that
addressed this. It made the
assertion that while Darth
Vader had lost both hands, he
was still as formidable, in the
force sense,
- Directly about Darth Vader
One might ask how none of the Jedi
at Qui-Gon's funeral noticed that
there was a Dark Lord of the Sith
standing right behind them. Darth
Vader and Obi-Wan only noticed
each other when on the same station
… It's apparently hard to pick up
another force-user without knowing
he or she is there…
- Darth Vader is a side topic here
Which should come first?
OpenSource Connections
9. SQL Like| CTRL+F |grep is
1. Extremely Slow
2. Not fuzzy -- Needs exact literal matches, no
fuzziness!
3. Unranked -- Simply says y/n whether there is a
match
OpenSource Connections
10. Search needs to be
1. FAST! A data structure that can efficiently take
search terms and return a set of documents
2. FUZZY! A way to record positional and fuzzy
modifications to text to assist matching
3. FRUITFUL! Relevant documents bubble to the top.
OpenSource Connections
11. Lets play with an implementation
• Your database’s full text search features
o MySQL, for example has a FULLTEXT index
o Works for trivial cases, not the path of wisdom
• Lucene -> Elasticsearch
Lucene
Solr
Elasticsearch
• Lucene, 1999 by Doug Cutting
• Java library for search
• Solr, 2006, Yonik Seely
• First to put Lucene behind an
http interface
• Still going strong
• Elasticsearch, 2010, Shay Banon
• Alternative implementation
• Extremely REST-Y
OpenSource Connections
12. Elasticsearch
• Create an index
curl –XPUT http://localhost:9200/stackexchange
• Index some docs!
curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
OpenSource Connections
13. What is being built?
The answer can be found in your textbook…
Book Index:
• Topics -> page no
• Very efficient tool – compare to
scanning the whole book!
Lucene uses an index:
• Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]
OpenSource Connections
14. Computers == Dumb
• Humans are smart
o I see “cat” or “cats” in the back of a book, no duh – jump
to page 9
• Computers are dumb,
o “CAT” != “cat” – no match returned
o “cat” != “cats” – no match returned
• Hence, when indexing, normalize text to more
searchable form:
cats -> cat
fitted -> fit
alumnus -> alumnu
OpenSource Connections
15. Normalization aka Text Analysis
• Raw input Filtered (char filter)
•
•
<p>Darth Vader dined with Luke</p>
Darth Vader dined with Luke
• Tokenized,
o Darth Vader dined with Luke
o [Darth] [Vader] [dined] [with] [Luke]
• Token filters (Lowercased, synonyms applied,
remove pointless words)
o [darth] [vader] [dine] [luke]
• Most importantly: this is highly configurable
OpenSource Connections
17. What is being built?
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>
curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
‚Body‛: ‚<p>We love Darth</p>‛,
‚Title‛: ‚...‛}’
OpenSource Connections
18. Ranking
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>
curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
‚Body‛: ‚<p>We love Darth</p>‛,
‚Title‛: ‚...‛}’
Can we store anything here to
help decide how relevant this
term is for this doc?
Yes!
- Term Frequency
- How much “darth” is in
this doc?
- Position within document
- Helps when we search for
the phrase “darth vader”
OpenSource Connections
19. Query Documents
• When did Darth Vader and Luke have dinner?
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true"
-d '
{
"query": {
"match": {
"Body": "luke darth dinner"
}
User Query
}
}
OpenSource Connections
20. What happens when we query?
luke darth dinner
How to consult
index for matches?
[darth]
Analysis
[luke]
[darth]
[dine]
Score for [darth]
docs (1 and 2)
[dine]
Score for [dine]
docs (1)
Return sorted
docs client
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>
...
OpenSource Connections
21. So Elasticsearch!
• FAST!
o Inverted index data structure is blazing fast
o Lucene is probably the most tuned implementation
• FUZZY!
o We use analysis to normalize text to canonical forms
o We can use positional information when querying (not
shown here)
• FRUITFUL!
o Relevant documents are scored based on relative term
frequency
OpenSource Connections
22. BUT WAIT THERE’S MORE
• Many non-traditional applications of “search”
o Rank file directory by proximity to current directory
o Geographic-aided search, rank based on distance and
search relevancy
o Q & A systems – Watson has a ton of Lucene
o Log aggregation, ie Kibana -- because in Lucene
everything is indexed!
• And many features!
o Spellchecking
o Facets
o More-like-this document
OpenSource Connections