How Search Engines Work (A Thing I Didn't Learn in University)

Toria Gibbs
@scarletdrive
How Search Engines Work
(A Thing I Didn’t Learn in University)

1 Introduction / who is this lady?
2 Text Search / inverted indexes
a Performance
b Quality
3 Relevance / quality part 2: ranking the results
4 Open Source Tools / free search engines!
5 Conclusion / bye
Agenda
2

Who is this lady?
3
Bachelor of
Computer
Science 2010
Toria Gibbs
@scarletdrive

4
2010 → 2020
😱
@scarletdrive

5
Raise your hand if you learned
about search engines in university
@scarletdrive

6
Story time!
📖
@scarletdrive

8
Search index?
🤔
@scarletdrive

9
Database index!
💡
@scarletdrive

10
They hired me!
😁
@scarletdrive

11
They hired me!
😬
(even though I was wrong)
@scarletdrive

12
🙋🏽 💁🏻‍♀
Hey Toria, can I get
some help?
Heck yes you can,
buddy!
@scarletdrive

13
🙋🏽
How much disk space
will my new ﬁeld use?
@scarletdrive

14
🙋🏽
How much disk space
🤷🏻‍♀
...............???
@scarletdrive

torias-pet-emporium.myshopify.com
@scarletdrive

17
Assume we have a database...
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99

18
cat
id title price
5 cat hat 5.00
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
@scarletdrive

19
cat
id title price
5 cat hat 5.00
8 dog booties 11.99
SELECT *
FROM items
@scarletdrive

2 / Text Search
20
A / Performance

n = items in database
m = max length of title strings
n·m
@scarletdrive21

n = items in database
m = max length of title strings = 256
O(n)
@scarletdrive22

n n · m (m=256)
10 2 560
100 25 600
1 000 256 000
10 000 2 560 000
100 000 25 600 000
1 000 000 256 000 000
10 000 000 2 560 000 000
@scarletdrive23

24
Let’s make it faster
@scarletdrive
id title price
5 cat hat 5.00
8 dog booties 11.99
We can look up an item by its
ID in constant time.

25
Let’s make it faster
@scarletdrive
id title price
5 cat hat 5.00
8 dog booties 11.99
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]

26
Inverted Index
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Map from words to
sets of IDs of records
which contained
those words

27
cat
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]

O(1)
@scarletdrive29
● Assumes perfect hash function
● Trade-offs: storage, pre-processing, complexity
● Additional lookup step still required

30
cat
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
5 cat hat 5.00

O(r)
@scarletdrive31
r = number of results found

@scarletdrive32
...but we usually only ask for a ﬁxed
number of results at a time
O(25) → O(1)

Search engines provide faster results
than a database for text search
@scarletdrive

2 / Text Search
34
B / Quality

id title price
5 cat hat 5.00
8 dog booties 11.99
@scarletdrive
SELECT *
FROM items

id title price
5 cat hat 5.00
8 dog booties 11.99
@scarletdrive
SELECT *
FROM items
● Search for “cat” incorrectly
returns “vacation hat for dog”

id title price
5 cat hat 5.00
8 dog booties 11.99
● Search for “cat” doesn’t return
“kitten mittens”
@scarletdrive
SELECT *
FROM items

id title price
5 cat hat 5.00
8 dog booties 11.99
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cats%’

SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
@scarletdrive

40
How did we do this?
@scarletdrive
id title price
5 cat hat 5.00
8 dog booties 11.99
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]

41
Analyzers
1. Tokenizers
2. Normalizers (a.k.a. ﬁlters)
○ Stemmers
○ Lowercase, character ﬁlters
○ Stop words
○ Synonyms
@scarletdrive

42
Tokenization
@scarletdrive
string: “cat hat”
array: [“cat”, “hat”]

43
Stemming
@scarletdrive
“dogs” → “dog”
“walking” → “walk”
“fetched” → “fetch”
“ran” → “run”

44
Lowercase
@scarletdrive
Character Filters
“Toria” → “toria”
“WOW” → “wow”
“résumé” → “resume”

45 @scarletdrive
Stop Words
Remove “the”, “and”,
“or”, “but”, etc...
Synonyms
“colour” → “color”
“lb” → “pound”

@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog”

47
id title price
5 cat hat 5.00
@scarletdrive
[“vacation”, “hat”, “for”, “dog”]
[“cat”, “hat”]
[“vacation”, “hat”, “dog”]
term ids
cat [5]
hat [4, 5]
dog [4]
vacation [4]
Tokenize it
Remove stop words

48 @scarletdrive
term ids
cat [5]
hat [4, 5]
dog [4]
vacation [4]
cat
id title price
5 cat hat 5.00
● Search for “cat” does not return “vacation hat for dog” due to tokenization

@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat”

50
id title price
5 cat hat 5.00
@scarletdrive
[“blue”, “hat”, “for”, “cats”]
[“blue”, “hat”, “cat”]
term ids
blue [3]
cat [3, 5]
hat [3, 5]
Tokenize it
Remove stop words
Stem it
[“blue”, “hat”, “for”, “cat”]

51
id title price
5 cat hat 5.00
@scarletdrive
term ids
blue [3]
cat [3, 5]
hat [3, 5]
cats
???

All transformations performed on
the input data for the index
are also performed on the query
@scarletdrive

53
id title price
5 cat hat 5.00
@scarletdrive
term ids
blue [3]
cat [3, 5]
hat [3, 5]
cats
Stem it
cat
● Search for “cats” does return
“cat hat” due to stemming

@scarletdrive
Quality Problems
2. “cats” search does not return “cat hat” ✓
3. “cat” search does not return “kitten mittens”

55
id title price
5 cat hat 5.00
@scarletdrive
[“kitten”, “mittens”]
[“cat”, “mitten”]
term ids
cat [5, 7]
hat [5]
mitten [7]
Tokenize it
Swap synonymsStem it
[“kitten”, “mitten”]

56 @scarletdrive
cat
id title price
5 cat hat 5.00
term ids
cat [5, 7]
hat [5]
mitten [7]
● Search for “cat” returns all
items with “cat” or “kitten” due
to synonyms

id title price
5 cat hat 5.00
term ids
cat [5, 7]
hat [5]
mitten [7]
57 @scarletdrive
kitten
Swap synonym
cat
● Search for “kitten” returns all
items with “cat” or “kitten” due
to synonyms

@scarletdrive
Quality Problems
2. “cats” search does not return “cat hat” ✓
3. “cat” search does not return “kitten mittens” ✓

Search engines provide faster and
better quality results than a
database for text search
@scarletdrive

60
🙋🏽
How much disk space
👩🏻‍🎓
I learned things!
I can help!
@scarletdrive

61
��🏽
@scarletdrive
It’s a string ﬁeld, but it’s only going
to be 100 characters long, max.
��
Can you tell me anything about the
characteristics of these strings?

62 @scarletdrive
id title
1 red cat mittens
2 blue dog mittens
3 blue hat for cats
4 vacation hat for dog
5 cat hat
6 red and blue dog hat
7 kitten mittens
8 dog booties
term ids
red [1, 6]
cat [1, 3, 5, 7]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
boot [8]
8 rows 8 rows

63 @scarletdrive
id text
1 good dog
2 bad dog
3 good dog
4 bad dog
5 good dog
6 bad dog
7 good dog
8 bad dog
...
100 bad dog
term ids
good [1, 3, 5, 7, 9,
11, 13, 15, 17,
19, … 99]
bad [2, 4, 6, 8, 10,
12, 14, 16, 18,
20, … 100]
dog [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11,
12, … 99, 100]
100 rows
3 rows

64
��🏽
@scarletdrive
Why yes, they are categories
which are static and well-deﬁned.
��
AWESOME.
categories
Pet Accessories
Pet Beds
term ids
pet ?
accessory ?
bed ?

1 Introduction / who is this lady?
2 Text Search / inverted indexes
a Performance
b Quality
3 Relevance / quality part 2: ranking the results
4 Open Source Tools / free search engines!
5 Conclusion / bye
Agenda
67

@scarletdrive
id title price
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
69
Relevance

tf-idf
term frequency
inverse document frequency
@scarletdrive

@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: cat
71

@scarletdrive
IDF(term) = loge
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: cat
72

@scarletdrive
IDF(term) = loge
1. The orange cat is a good cat.
(assume 100 records which all contain
“cat” in them)
Query: orange cat
73

@scarletdrive
IDF(term) = loge
Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1)
score = score(cat, doc2) + s(orange, doc2)
74

@scarletdrive
IDF(term) = loge
Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
75

@scarletdrive
IDF(term) = loge
Result order = [2, 1]Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
1/7 = 0.14
1/5 = 0.20
76

77
Better Relevance
● Phrase matching
● Fuzzy matching, spelling correction
● User factors: location, language
● Other factors: quality, recency, randomness

bm25
is the cool new thing
RIP tf-idf
@scarletdrive
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables

@scarletdrive
● Inverted index
● Basic tokenization and
normalization
● Ranking
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization and
normalization
● Advanced ranking
● Plugins
81

Which one should I pick?
@scarletdrive
It doesn’t matter

@scarletdrive
● Most projects work well with either
● Getting conﬁguration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/

Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
@scarletdrive

86
Recap
● Inverted index for text search
○ Faster than a database
○ Better quality than a database
● Ranking for relevance with tf-idf (or bm25)
● Solr and Elasticsearch are great open source solutions
@scarletdrive

Thank you!
careers.shopify.com
engineering.shopify.com (blog)

How Search Engines Work (A Thing I Didn't Learn in University)

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

How Search Engines Work (A Thing I Didn't Learn in University)